From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 4C88FA0559; Tue, 17 Mar 2020 10:31:28 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 0D2822B9E; Tue, 17 Mar 2020 10:31:27 +0100 (CET) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id C2AC0CF3 for ; Tue, 17 Mar 2020 10:31:24 +0100 (CET) IronPort-SDR: bX1Ic3+2sfxaiQ8mDHWrjDd+UroiGwb9Dgle5mwyKB/m2QXqmF3ZBe47PdAr0LWiIH+GMyy4P+ HuDuoPOPnEPQ== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2020 02:31:23 -0700 IronPort-SDR: kKTWXdrijanog9sI0rwEMKdKsX9TAcsIEzlvoyXy+EJm0pxl6effY9RR/LWAQlAyzfBim9OEBB /WIoddggTUSw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,563,1574150400"; d="scan'208";a="244422946" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by orsmga003.jf.intel.com with ESMTP; 17 Mar 2020 02:31:23 -0700 Received: from shsmsx602.ccr.corp.intel.com (10.109.6.142) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 17 Mar 2020 02:31:23 -0700 Received: from shsmsx602.ccr.corp.intel.com (10.109.6.142) by SHSMSX602.ccr.corp.intel.com (10.109.6.142) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Tue, 17 Mar 2020 17:31:20 +0800 Received: from shsmsx602.ccr.corp.intel.com ([10.109.6.142]) by SHSMSX602.ccr.corp.intel.com ([10.109.6.142]) with mapi id 15.01.1713.004; Tue, 17 Mar 2020 17:31:20 +0800 From: "Hu, Jiayu" To: "Liu, Yong" , "dev@dpdk.org" CC: "maxime.coquelin@redhat.com" , "Ye, Xiaolong" , "Wang, Zhihong" Thread-Topic: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations Thread-Index: AQHV/CycpGwAMFsXnEOJ5O8MFZCr9KhMd3ow Date: Tue, 17 Mar 2020 09:31:20 +0000 Message-ID: <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com> References: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com> <1584436885-18651-4-git-send-email-jiayu.hu@intel.com> <86228AFD5BCD8E4EBFD2B90117B5E81E6350B05E@SHSMSX103.ccr.corp.intel.com> In-Reply-To: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B05E@SHSMSX103.ccr.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-reaction: no-action dlp-version: 11.2.0.6 dlp-product: dlpe-windows x-originating-ip: [10.239.127.36] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Marvin, Thanks for comments. Replies are inline. > -----Original Message----- > From: Liu, Yong > Sent: Tuesday, March 17, 2020 3:21 PM > To: Hu, Jiayu ; dev@dpdk.org > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong ; > Wang, Zhihong ; Hu, Jiayu > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to > accelerate Tx operations >=20 > Hi Jiayu, > Some comments are inline. >=20 > Thanks, > Marvin >=20 > > -----Original Message----- > > From: dev On Behalf Of Jiayu Hu > > Sent: Tuesday, March 17, 2020 5:21 PM > > To: dev@dpdk.org > > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong ; > > Wang, Zhihong ; Hu, Jiayu > > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to > > accelerate Tx operations > > > > > > int vhost_logtype; > > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; > > #define ETH_VHOST_IOMMU_SUPPORT "iommu-support" > > #define ETH_VHOST_POSTCOPY_SUPPORT "postcopy-support" > > #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso" > > +#define ETH_VHOST_DMA_ARG "dmas" > > #define VHOST_MAX_PKT_BURST 32 > > > > +/* ring size of I/OAT */ > > +#define IOAT_RING_SIZE 1024 > > + >=20 > Jiayu, > Configured I/OAT ring size is 1024 here, but do not see in_flight or > nr_batching size check in enqueue function. > Is there any possibility that IOAT ring exhausted? We will wait for IOAT's copy completion, when its ring is full. This is to guarantee that all enqueue to IOAT can success. > > +struct dma_info_input { > > + struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2]; > > + uint16_t nr; > > +}; > > + > > +static inline int > > +open_dma(const char *key __rte_unused, const char *value, void > > *extra_args) > > +{ > > + struct dma_info_input *dma_info =3D extra_args; > > + char *input =3D strndup(value, strlen(value) + 1); > > + char *addrs =3D input; > > + char *ptrs[2]; > > + char *start, *end, *substr; > > + int64_t qid, vring_id; > > + struct rte_ioat_rawdev_config config; > > + struct rte_rawdev_info info =3D { .dev_private =3D &config }; > > + char name[32]; > > + int dev_id; > > + int ret =3D 0; > > + > > + while (isblank(*addrs)) > > + addrs++; > > + if (addrs =3D=3D '\0') { > > + VHOST_LOG(ERR, "No input DMA addresses\n"); > > + ret =3D -1; > > + goto out; > > + } > > + > > + /* process DMA devices within bracket. */ > > + addrs++; > > + substr =3D strtok(addrs, ";]"); > > + if (!substr) { > > + VHOST_LOG(ERR, "No input DMA addresse\n"); > > + ret =3D -1; > > + goto out; > > + } > > + > > + do { > > + rte_strsplit(substr, strlen(substr), ptrs, 2, '@'); > > + > Function rte_strsplit can be failed. Need to check return value. Thanks. Will check it later. >=20 > > + start =3D strstr(ptrs[0], "txq"); > > + if (start =3D=3D NULL) { > > + VHOST_LOG(ERR, "Illegal queue\n"); > > + ret =3D -1; > > + goto out; > > + } > > + > > + start +=3D 3; >=20 > It's better not use hardcode value. >=20 > > + qid =3D strtol(start, &end, 0); > > + if (end =3D=3D start) { > > + VHOST_LOG(ERR, "No input queue ID\n"); > > + ret =3D -1; > > + goto out; > > + } > > + > > + vring_id =3D qid * 2 + VIRTIO_RXQ; > > + if (rte_pci_addr_parse(ptrs[1], > > + &dma_info->dmas[vring_id].addr) < 0) { > > + VHOST_LOG(ERR, "Invalid DMA address %s\n", > > ptrs[1]); > > + ret =3D -1; > > + goto out; > > + } > > + > > + rte_pci_device_name(&dma_info->dmas[vring_id].addr, > > + name, sizeof(name)); > > + dev_id =3D rte_rawdev_get_dev_id(name); > > + if (dev_id =3D=3D (uint16_t)(-ENODEV) || > > + dev_id =3D=3D (uint16_t)(-EINVAL)) { > > + VHOST_LOG(ERR, "Cannot find device %s.\n", name); > > + ret =3D -1; > > + goto out; > > + } > > + > Multiple queues can't share one IOAT device. Check should be here as it i= s > not allowed. I just claim it in the doc. Will add the check later. > > + > > +/* notify front-end of enqueued packets */ > > +static __rte_always_inline void > > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring > > *dma_vr) > > +{ > > + vhost_vring_call_split(dev, dma_vr); > > +} > > + > > +int > > +free_dma_done(void *dev, void *dma_vr) > > +{ > > + uintptr_t flags[255], tmps[255]; >=20 > Please add meaningful macro for 255, not sure why limitation is 255 not 2= 56. The second parameter of rte_ioat_completed_copies() is uint8_t, so the max value can only be 255. I can replace it with UINT8_MAX later. >=20 > > + int dma_done, i; > > + uint16_t used_idx; > > + struct pmd_internal *device =3D dev; > > + struct dma_vring *dma_vring =3D dma_vr; > > + > > + dma_done =3D rte_ioat_completed_copies(dma_vring->dev_id, 255, > > flags, > > + tmps); > > + if (unlikely(dma_done <=3D 0)) > > + return dma_done; > > + > > + dma_vring->nr_inflight -=3D dma_done; >=20 > Not sure whether DMA engine will return completion as input sequence, > mbuf free should after index update done. IMO, pktmbuf can be freed once the IOAT doesn't occupy it. We don't need to wait for the update of used index.=20 This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1, once submit its copy job to the IOAT. If it has N copies that are all offlo= aded to the IOAT, the refcnt will be increased by N. On completion of a IOAT cop= y, if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg(). When the value of refcnt reaches 1, which means all its copies are complete= d, the mbuf will be freed. >=20 > > + for (i =3D 0; i < dma_done; i++) { > > + if ((uint64_t)flags[i] >=3D dma_vring->max_indices) { > > + struct rte_mbuf *pkt =3D (struct rte_mbuf *)flags[i]; > > + > > + /** > > + * the DMA completes a packet copy job, we > > + * decrease the refcnt or free the mbuf segment. > > + */ > > + rte_pktmbuf_free_seg(pkt); > > + } else { > > + uint16_t id =3D flags[i]; > > + > > + /** > > + * the DMA completes updating index of the > > + * used ring. > > + */ > > + used_idx =3D dma_vring->indices[id].data; > > + VHOST_LOG(DEBUG, "The DMA finishes updating > > index %u " > > + "for the used ring.\n", used_idx); > > + > > + dma_vring->copy_done_used =3D used_idx; > > + vhost_dma_vring_call(device, dma_vring); > > + put_used_index(dma_vring->indices, > > + dma_vring->max_indices, id); > > + } > > + } > > + return dma_done; > > +} > > + > > +static __rte_always_inline bool > > +rxvq_is_mergeable(struct pmd_internal *dev) > > +{ > > + return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF); > > +} > > + >=20 > I'm not sure whether shadow used ring can help in DMA acceleration > scenario. > Vhost driver will wait until DMA copy is done. Optimization in CPU move m= ay > not help in overall performance but just add weird codes. For small copies, we still use the CPU, as the IOAT is less efficient in sm= all copies. Therefore, I think we still need CPU optimization here. > > +static __rte_always_inline void > > +flush_shadow_used_ring_split(struct pmd_internal *dev, > > + struct dma_vring *dma_vr) > > +{ > > + uint16_t used_idx =3D dma_vr->last_used_idx & (dma_vr->vr.size - 1); > > + > > + if (used_idx + dma_vr->shadow_used_idx <=3D dma_vr->vr.size) { > > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0, > > + dma_vr->shadow_used_idx); > > + } else { > > + uint16_t size; > > + > > + /* update used ring interval [used_idx, vr->size] */ > > + size =3D dma_vr->vr.size - used_idx; > > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0, > size); > > + > > + /* update the left half used ring interval [0, left_size] */ > > + do_flush_shadow_used_ring_split(dma_vr, 0, size, > > + dma_vr->shadow_used_idx - > > + size); > > + } > > + dma_vr->last_used_idx +=3D dma_vr->shadow_used_idx; > > + > > + rte_smp_wmb(); > > + > > + if (dma_vr->nr_inflight > 0) { > > + struct ring_index *index; > > + > > + index =3D get_empty_index(dma_vr->indices, dma_vr- > > >max_indices); > > + index->data =3D dma_vr->last_used_idx; > > + while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id, > > + index->pa, > > + dma_vr->used_idx_hpa, > > + sizeof(uint16_t), > > + index->idx, 0, 0) =3D=3D > > + 0)) { > > + int ret; > > + > > + do { > > + ret =3D dma_vr->dma_done_fn(dev, dma_vr); > > + } while (ret <=3D 0); > > + } > > + dma_vr->nr_batching++; > > + dma_vr->nr_inflight++; > > + } else { > > + /** > > + * we update index of used ring when all previous copy > > + * jobs are completed. > > + * > > + * When enabling DMA copy, if there are outstanding copy > > + * jobs of the DMA, to avoid the DMA overwriting the > > + * write of the CPU, the DMA is in charge of updating > > + * the index of used ring. > > + */ >=20 > According to comments, here should be DMA data move. But following code > is CPU data move. Anything wrong here? The update of used index is done by the CPU, if there are no inflight IOAT = copies; otherwise, it's done by the IOAT. The code in "else {}" is executed only wh= en dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU i= s in charge of updating used ring's index. >=20 > > + *(volatile uint16_t *)&dma_vr->vr.used->idx +=3D > > + dma_vr->shadow_used_idx; > > + dma_vr->copy_done_used +=3D dma_vr->shadow_used_idx; > > + } > > + > > + dma_vr->shadow_used_idx =3D 0; > > +} > > + > > 2.7.4