From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id B32E28D35 for ; Thu, 10 Sep 2015 08:32:41 +0200 (CEST) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga102.jf.intel.com with ESMTP; 09 Sep 2015 23:32:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,502,1437462000"; d="scan'208";a="558830114" Received: from pgsmsx103.gar.corp.intel.com ([10.221.44.82]) by FMSMGA003.fm.intel.com with ESMTP; 09 Sep 2015 23:32:38 -0700 Received: from shsmsx103.ccr.corp.intel.com (10.239.110.14) by PGSMSX103.gar.corp.intel.com (10.221.44.82) with Microsoft SMTP Server (TLS) id 14.3.224.2; Thu, 10 Sep 2015 14:32:36 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.171]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.248]) with mapi id 14.03.0224.002; Thu, 10 Sep 2015 14:32:35 +0800 From: "Xie, Huawei" To: "Michael S. Tsirkin" Thread-Topic: virtio optimization idea Thread-Index: AdDm6zPdM5XrIXmIQz2JKkVwrZfYFQ== Date: Thu, 10 Sep 2015 06:32:35 +0000 Message-ID: References: <20150909073339.GA16849@redhat.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" , "virtualization@lists.linux-foundation.org" Subject: Re: [dpdk-dev] virtio optimization idea X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2015 06:32:42 -0000 On 9/9/2015 3:34 PM, Michael S. Tsirkin wrote:=0A= > On Fri, Sep 04, 2015 at 08:25:05AM +0000, Xie, Huawei wrote:=0A= >> Hi:=0A= >>=0A= >> Recently I have done one virtio optimization proof of concept. The=0A= >> optimization includes two parts:=0A= >> 1) avail ring set with fixed descriptors=0A= >> 2) RX vectorization=0A= >> With the optimizations, we could have several times of performance boost= =0A= >> for purely vhost-virtio throughput.=0A= > Thanks!=0A= > I'm very happy to see people work on the virtio ring format=0A= > optimizations.=0A= >=0A= > I think it's best to analyze each optimization separately,=0A= > unless you see a reason why they would only give benefit when applied=0A= > together.=0A= =0A= Agree. Will split the patch to see each change's benefit. Of course it=0A= is also very common two give much more gain than the sum of individual.=0A= =0A= >=0A= > Also ideally, we'd need a unit test to show the performance impact.=0A= > We've been using the tests in tools/virtio/ under linux,=0A= > feel free to enhance these to simulate more workloads, or=0A= > to suggest something else entirely.=0A= If possible, we would provide perf test case under tools/virtio.=0A= I am interested if the optimization could help kernel virtio-net driver=0A= performance, if not the other is the bottleneck.=0A= >=0A= > Avail ring is initialized with fixed descriptor and is never changed,=0A= > i.e, the index value of the nth avail ring entry is always n, which=0A= > means virtio PMD is actually refilling desc ring only, without having to= =0A= > change avail ring.=0A= > When vhost fetches avail ring, if not evicted, it is always in its first= =0A= > level cache.=0A= >=0A= > When RX receives packets from used ring, we use the used->idx as the=0A= > desc idx. This requires that vhost processes and returns descs from=0A= > avail ring to used ring in order, which is true for both current dpdk=0A= > vhost and kernel vhost implementation. In my understanding, there is no= =0A= > necessity for vhost net to process descriptors OOO. One case could be=0A= > zero copy, for example, if one descriptor doesn't meet zero copy=0A= > requirment, we could directly return it to used ring, earlier than the=0A= > descriptors in front of it.=0A= > To enforce this, i want to use a reserved bit to indicate in order=0A= > processing of descriptors.=0A= > So what's the point in changing the idx for the used ring?=0A= > You need to communicate the length to the guest anyway, don't you?=0A= For guest virtio driver, we only use the length field in the entry of=0A= the used ring and don't use the index in the entry. Instead, use=0A= used->idx & 255 as the desc idx for RX, and used->idx & 127 as the desc=0A= idx for TX.=0A= For vhost driver, as it couldn't assume fixed ring layout(to support=0A= legacy virtio), it needs to update the idx in the used ring entry.=0A= >=0A= >> For tx ring, the arrangement is like below. Each transmitted mbuf needs= =0A= >> a desc for virtio_net_hdr, so actually we have only 128 free slots.=0A= > Just fix this one. Support ANY_LAYOUT and then you can put data=0A= > linearly. And/or support INDIRECT_DESC and then you can=0A= > use an indirect descriptor.=0A= Would check those two features.=0A= >=0A= >> =0A= >>=0A= > This one came out corrupted.=0A= Actually i ever replied to the original mail and fixed it. Copy it here=0A= again.=0A= =0A= ++ = =0A= || = =0A= || = =0A= +-----+-----+-----+--------------+------+------+------+ = =0A= | 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring = =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= | | | || | | | = =0A= v v v || v v v = =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= | 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for = virtio_net_hdr=0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= | | | || | | | = =0A= v v v || v v v = =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= | 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for = tx dat =0A= =0A= =0A= >> =0A= >> /huawei=0A= >=0A= > Please Cc virtio related discussion more widely.=0A= > I added the virtualization mailing list.=0A= >=0A= >=0A= > So what you want to do is avoid changing the avail=0A= > ring, isn't it enough to pre-format it and cache=0A= > the values in the guest?=0A= >=0A= > Host can then keep using avail ring without changes, it will stay in cach= e.=0A= > Something like the below for guest should do the trick (untested):=0A= =0A= Good optimization to tackle the cache line transfer issue.=0A= Your change could avoid changing avail ring if it points to same head=0A= index. It could improve kernel virtio-net driver's performance if the=0A= avail ring doesn't change in running.=0A= We would also investigating applying this idea to DPDK virtio PMD slow path= .=0A= For the optimization i did, which i call fast path, the driver "knows"=0A= in theory the avail ring willn't never get changed, so it doesn't a)=0A= allocate and free descriptors b) care the avail ring.=0A= Based on the fact the DPDK pmd is actually using the simple desc ring=0A= below, we could directly map avail->idx and used->idx to desc idx, and=0A= use vector instruction to do parallel processing.=0A= =0A= +-+--+-+--+-+-+---------+---+--+---+ =0A= | 0 | 1 | 2 | ... | 254 | 255 | rx desc ring=0A= +----+----+---+-------------+------+ =0A= =0A= =0A= >=0A= > Signed-off-by: Michael S. Tsirkin =0A= >=0A= > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c= =0A= > index 096b857..9363b50 100644=0A= > --- a/drivers/virtio/virtio_ring.c=0A= > +++ b/drivers/virtio/virtio_ring.c=0A= > @@ -91,6 +91,7 @@ struct vring_virtqueue {=0A= > bool last_add_time_valid;=0A= > ktime_t last_add_time;=0A= > #endif=0A= > + u16 *avail;=0A= > =0A= > /* Tokens for callbacks. */=0A= > void *data[];=0A= > @@ -236,7 +237,10 @@ static inline int virtqueue_add(struct virtqueue *_v= q,=0A= > /* Put entry in available array (but don't update avail->idx until they= =0A= > * do sync). */=0A= > avail =3D virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) & (vq->vring= .num - 1);=0A= > - vq->vring.avail->ring[avail] =3D cpu_to_virtio16(_vq->vdev, head);=0A= > + if (vq->avail[avail] !=3D head) {=0A= > + vq->avail[avail] =3D head;=0A= > + vq->vring.avail->ring[avail] =3D cpu_to_virtio16(_vq->vdev, head);=0A= > + }=0A= > =0A= > /* Descriptors and available array need to be set before we expose the= =0A= > * new available array entries. */=0A= > @@ -724,6 +728,11 @@ struct virtqueue *vring_new_virtqueue(unsigned int i= ndex,=0A= > vq =3D kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);=0A= > if (!vq)=0A= > return NULL;=0A= > + vq->avail =3D kzalloc(sizeof (*vq->avail) * num, GFP_KERNEL);=0A= > + if (!va->avail) {=0A= > + kfree(vq);=0A= > + return NULL;=0A= > + }=0A= > =0A= > vring_init(&vq->vring, num, pages, vring_align);=0A= > vq->vq.callback =3D callback;=0A= >=0A= =0A=