From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 867FC1B3EE for ; Wed, 12 Dec 2018 17:34:43 +0100 (CET) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B6180168F; Wed, 12 Dec 2018 16:34:42 +0000 (UTC) Received: from [10.36.112.28] (ovpn-112-28.ams2.redhat.com [10.36.112.28]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 140D35D70A; Wed, 12 Dec 2018 16:34:34 +0000 (UTC) To: Ilya Maximets , dev@dpdk.org, tiwei.bie@intel.com, zhihong.wang@intel.com, jfreimann@redhat.com, mst@redhat.com References: <20181212082403.12002-1-maxime.coquelin@redhat.com> From: Maxime Coquelin Message-ID: Date: Wed, 12 Dec 2018 17:34:31 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Wed, 12 Dec 2018 16:34:42 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH] vhost: batch used descriptors chains write-back with packed ring X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Dec 2018 16:34:44 -0000 Hi Ilya, On 12/12/18 4:23 PM, Ilya Maximets wrote: > On 12.12.2018 11:24, Maxime Coquelin wrote: >> Instead of writing back descriptors chains in order, let's >> write the first chain flags last in order to improve batching. >> >> With Kernel's pktgen benchmark, ~3% performance gain is measured. >> >> Signed-off-by: Maxime Coquelin >> --- >> lib/librte_vhost/virtio_net.c | 39 +++++++++++++++++++++-------------- >> 1 file changed, 24 insertions(+), 15 deletions(-) >> > > Hi. > I made some rough testing on my ARMv8 system with this patch and v1 of it. > Here is the performance difference with current master: > v1: +1.1 % > v2: -3.6 % > > So, write barriers are quiet heavy in practice. Thanks for testing it on ARM. Indeed, SMP WMB is heavier on ARM. To reduce the number of WMBs, I propose to revert back to original implementation by first writing all .len and .id fields, do the write barrier then writing all flags by writing first one last: diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index 5e1a1a727..58a277c53 100644 --- a/lib/librte_vhost/virtio_net.c +++ b/lib/librte_vhost/virtio_net.c @@ -136,6 +136,8 @@ flush_shadow_used_ring_packed(struct virtio_net *dev, { int i; uint16_t used_idx = vq->last_used_idx; + uint16_t head_idx = vq->last_used_idx; + uint16_t head_flags = 0; /* Split loop in two to save memory barriers */ for (i = 0; i < vq->shadow_used_idx; i++) { @@ -165,12 +167,17 @@ flush_shadow_used_ring_packed(struct virtio_net *dev, flags &= ~VRING_DESC_F_AVAIL; } - vq->desc_packed[vq->last_used_idx].flags = flags; + if (i > 0) { + vq->desc_packed[vq->last_used_idx].flags = flags; - vhost_log_cache_used_vring(dev, vq, + vhost_log_cache_used_vring(dev, vq, vq->last_used_idx * sizeof(struct vring_packed_desc), sizeof(struct vring_packed_desc)); + } else { + head_idx = vq->last_used_idx; + head_flags = flags; + } vq->last_used_idx += vq->shadow_used_packed[i].count; if (vq->last_used_idx >= vq->size) { @@ -179,7 +186,13 @@ flush_shadow_used_ring_packed(struct virtio_net *dev, } } - rte_smp_wmb(); + vq->desc_packed[head_idx].flags = head_flags; + + vhost_log_cache_used_vring(dev, vq, + vq->last_used_idx * + sizeof(struct vring_packed_desc), + sizeof(struct vring_packed_desc)); + vq->shadow_used_idx = 0; vhost_log_cache_sync(dev, vq); } > My testcase is the three instances of testpmd on a same host (with v11 from Jens): > > txonly (virtio_user0) --> fwd mode io (vhost0, vhost1) --> rxonly (virtio_user1) > > Best regards, Ilya Maximets. > >> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c >> index 5e1a1a727..c0b3d1137 100644 >> --- a/lib/librte_vhost/virtio_net.c >> +++ b/lib/librte_vhost/virtio_net.c >> @@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev, >> struct vhost_virtqueue *vq) >> { >> int i; >> - uint16_t used_idx = vq->last_used_idx; >> + uint16_t head_flags, head_idx = vq->last_used_idx; >> >> - /* Split loop in two to save memory barriers */ >> - for (i = 0; i < vq->shadow_used_idx; i++) { >> - vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id; >> - vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len; >> - >> - used_idx += vq->shadow_used_packed[i].count; >> - if (used_idx >= vq->size) >> - used_idx -= vq->size; >> - } >> - >> - rte_smp_wmb(); >> + if (unlikely(vq->shadow_used_idx == 0)) >> + return; >> >> for (i = 0; i < vq->shadow_used_idx; i++) { >> uint16_t flags; >> @@ -165,12 +156,24 @@ flush_shadow_used_ring_packed(struct virtio_net *dev, >> flags &= ~VRING_DESC_F_AVAIL; >> } >> >> - vq->desc_packed[vq->last_used_idx].flags = flags; >> + vq->desc_packed[vq->last_used_idx].id = >> + vq->shadow_used_packed[i].id; >> + vq->desc_packed[vq->last_used_idx].len = >> + vq->shadow_used_packed[i].len; >> + >> + rte_smp_wmb(); >> >> - vhost_log_cache_used_vring(dev, vq, >> + if (i > 0) { >> + vq->desc_packed[vq->last_used_idx].flags = flags; >> + >> + vhost_log_cache_used_vring(dev, vq, >> vq->last_used_idx * >> sizeof(struct vring_packed_desc), >> sizeof(struct vring_packed_desc)); >> + } else { >> + head_idx = vq->last_used_idx; >> + head_flags = flags; >> + } >> >> vq->last_used_idx += vq->shadow_used_packed[i].count; >> if (vq->last_used_idx >= vq->size) { >> @@ -179,8 +182,14 @@ flush_shadow_used_ring_packed(struct virtio_net *dev, >> } >> } >> >> - rte_smp_wmb(); >> + vq->desc_packed[head_idx].flags = head_flags; >> vq->shadow_used_idx = 0; >> + >> + vhost_log_cache_used_vring(dev, vq, >> + head_idx * >> + sizeof(struct vring_packed_desc), >> + sizeof(struct vring_packed_desc)); >> + >> vhost_log_cache_sync(dev, vq); >> } >> >>