From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id AD746A056A; Wed, 18 Mar 2020 02:22:45 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 305FA1C08E; Wed, 18 Mar 2020 02:22:45 +0100 (CET) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 8145E1C08D for ; Wed, 18 Mar 2020 02:22:43 +0100 (CET) IronPort-SDR: HhH2mLj6aObMHJVBXCA4TS/9yPz7tJZlYiUPLj6Z9DrAANAzgG/aOkIOSOjKRJesMR02NfSPmY VJ4nXWRxBF5A== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2020 18:22:42 -0700 IronPort-SDR: f5xvNL5IucfBFebQt2QJniBewfRlpNnuxwIX3JRh/RyjNi5X76PN7NUcyrquSzbxWrNZQr15UI oaCnj+Z7Mr8g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,565,1574150400"; d="scan'208";a="443987541" Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201]) by fmsmga005.fm.intel.com with ESMTP; 17 Mar 2020 18:22:42 -0700 Received: from fmsmsx153.amr.corp.intel.com (10.18.125.6) by FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 17 Mar 2020 18:22:42 -0700 Received: from shsmsx108.ccr.corp.intel.com (10.239.4.97) by FMSMSX153.amr.corp.intel.com (10.18.125.6) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 17 Mar 2020 18:22:42 -0700 Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.137]) by SHSMSX108.ccr.corp.intel.com ([169.254.8.235]) with mapi id 14.03.0439.000; Wed, 18 Mar 2020 09:22:39 +0800 From: "Liu, Yong" To: "Hu, Jiayu" , "dev@dpdk.org" CC: "maxime.coquelin@redhat.com" , "Ye, Xiaolong" , "Wang, Zhihong" Thread-Topic: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations Thread-Index: AQHV/AXmSvVjRouUCkiasNlGvhPWIqhMSZsA//+13gCAAYqqsA== Date: Wed, 18 Mar 2020 01:22:39 +0000 Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B775@SHSMSX103.ccr.corp.intel.com> References: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com> <1584436885-18651-4-git-send-email-jiayu.hu@intel.com> <86228AFD5BCD8E4EBFD2B90117B5E81E6350B05E@SHSMSX103.ccr.corp.intel.com> <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com> In-Reply-To: <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: Hu, Jiayu > Sent: Tuesday, March 17, 2020 5:31 PM > To: Liu, Yong ; dev@dpdk.org > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong ; > Wang, Zhihong > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to > accelerate Tx operations >=20 > Hi Marvin, >=20 > Thanks for comments. Replies are inline. >=20 > > -----Original Message----- > > From: Liu, Yong > > Sent: Tuesday, March 17, 2020 3:21 PM > > To: Hu, Jiayu ; dev@dpdk.org > > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong ; > > Wang, Zhihong ; Hu, Jiayu > > > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to > > accelerate Tx operations > > > > Hi Jiayu, > > Some comments are inline. > > > > Thanks, > > Marvin > > > > > -----Original Message----- > > > From: dev On Behalf Of Jiayu Hu > > > Sent: Tuesday, March 17, 2020 5:21 PM > > > To: dev@dpdk.org > > > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong > ; > > > Wang, Zhihong ; Hu, Jiayu > > > > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to > > > accelerate Tx operations > > > > > > > > > int vhost_logtype; > > > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; > > > #define ETH_VHOST_IOMMU_SUPPORT "iommu-support" > > > #define ETH_VHOST_POSTCOPY_SUPPORT "postcopy-support" > > > #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso" > > > +#define ETH_VHOST_DMA_ARG "dmas" > > > #define VHOST_MAX_PKT_BURST 32 > > > > > > +/* ring size of I/OAT */ > > > +#define IOAT_RING_SIZE 1024 > > > + > > > > Jiayu, > > Configured I/OAT ring size is 1024 here, but do not see in_flight or > > nr_batching size check in enqueue function. > > Is there any possibility that IOAT ring exhausted? >=20 > We will wait for IOAT's copy completion, when its ring is full. > This is to guarantee that all enqueue to IOAT can success. >=20 > > > +struct dma_info_input { > > > + struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2]; > > > + uint16_t nr; > > > +}; > > > + > > > +static inline int > > > +open_dma(const char *key __rte_unused, const char *value, void > > > *extra_args) > > > +{ > > > + struct dma_info_input *dma_info =3D extra_args; > > > + char *input =3D strndup(value, strlen(value) + 1); > > > + char *addrs =3D input; > > > + char *ptrs[2]; > > > + char *start, *end, *substr; > > > + int64_t qid, vring_id; > > > + struct rte_ioat_rawdev_config config; > > > + struct rte_rawdev_info info =3D { .dev_private =3D &config }; > > > + char name[32]; > > > + int dev_id; > > > + int ret =3D 0; > > > + > > > + while (isblank(*addrs)) > > > + addrs++; > > > + if (addrs =3D=3D '\0') { > > > + VHOST_LOG(ERR, "No input DMA addresses\n"); > > > + ret =3D -1; > > > + goto out; > > > + } > > > + > > > + /* process DMA devices within bracket. */ > > > + addrs++; > > > + substr =3D strtok(addrs, ";]"); > > > + if (!substr) { > > > + VHOST_LOG(ERR, "No input DMA addresse\n"); > > > + ret =3D -1; > > > + goto out; > > > + } > > > + > > > + do { > > > + rte_strsplit(substr, strlen(substr), ptrs, 2, '@'); > > > + > > Function rte_strsplit can be failed. Need to check return value. >=20 > Thanks. Will check it later. >=20 > > > > > + start =3D strstr(ptrs[0], "txq"); > > > + if (start =3D=3D NULL) { > > > + VHOST_LOG(ERR, "Illegal queue\n"); > > > + ret =3D -1; > > > + goto out; > > > + } > > > + > > > + start +=3D 3; > > > > It's better not use hardcode value. > > > > > + qid =3D strtol(start, &end, 0); > > > + if (end =3D=3D start) { > > > + VHOST_LOG(ERR, "No input queue ID\n"); > > > + ret =3D -1; > > > + goto out; > > > + } > > > + > > > + vring_id =3D qid * 2 + VIRTIO_RXQ; > > > + if (rte_pci_addr_parse(ptrs[1], > > > + &dma_info->dmas[vring_id].addr) < 0) { > > > + VHOST_LOG(ERR, "Invalid DMA address %s\n", > > > ptrs[1]); > > > + ret =3D -1; > > > + goto out; > > > + } > > > + > > > + rte_pci_device_name(&dma_info->dmas[vring_id].addr, > > > + name, sizeof(name)); > > > + dev_id =3D rte_rawdev_get_dev_id(name); > > > + if (dev_id =3D=3D (uint16_t)(-ENODEV) || > > > + dev_id =3D=3D (uint16_t)(-EINVAL)) { > > > + VHOST_LOG(ERR, "Cannot find device %s.\n", name); > > > + ret =3D -1; > > > + goto out; > > > + } > > > + > > Multiple queues can't share one IOAT device. Check should be here as it= is > > not allowed. >=20 > I just claim it in the doc. Will add the check later. >=20 > > > + > > > +/* notify front-end of enqueued packets */ > > > +static __rte_always_inline void > > > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring > > > *dma_vr) > > > +{ > > > + vhost_vring_call_split(dev, dma_vr); > > > +} > > > + > > > +int > > > +free_dma_done(void *dev, void *dma_vr) > > > +{ > > > + uintptr_t flags[255], tmps[255]; > > > > Please add meaningful macro for 255, not sure why limitation is 255 not > 256. >=20 > The second parameter of rte_ioat_completed_copies() is uint8_t, so the > max > value can only be 255. I can replace it with UINT8_MAX later. >=20 > > > > > + int dma_done, i; > > > + uint16_t used_idx; > > > + struct pmd_internal *device =3D dev; > > > + struct dma_vring *dma_vring =3D dma_vr; > > > + > > > + dma_done =3D rte_ioat_completed_copies(dma_vring->dev_id, 255, > > > flags, > > > + tmps); > > > + if (unlikely(dma_done <=3D 0)) > > > + return dma_done; > > > + > > > + dma_vring->nr_inflight -=3D dma_done; > > > > Not sure whether DMA engine will return completion as input sequence, > > mbuf free should after index update done. >=20 > IMO, pktmbuf can be freed once the IOAT doesn't occupy it. > We don't need to wait for the update of used index. >=20 > This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1, > once submit its copy job to the IOAT. If it has N copies that are all off= loaded > to the IOAT, the refcnt will be increased by N. On completion of a IOAT c= opy, > if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg(). > When the value of refcnt reaches 1, which means all its copies are > completed, > the mbuf will be freed. >=20 Thanks for reply.=20 My concern is that whether IOAT can preserve order as input sequence. If ha= rdware can guarantee that, no more question here. > > > > > + for (i =3D 0; i < dma_done; i++) { > > > + if ((uint64_t)flags[i] >=3D dma_vring->max_indices) { > > > + struct rte_mbuf *pkt =3D (struct rte_mbuf *)flags[i]; > > > + > > > + /** > > > + * the DMA completes a packet copy job, we > > > + * decrease the refcnt or free the mbuf segment. > > > + */ > > > + rte_pktmbuf_free_seg(pkt); > > > + } else { > > > + uint16_t id =3D flags[i]; > > > + > > > + /** > > > + * the DMA completes updating index of the > > > + * used ring. > > > + */ > > > + used_idx =3D dma_vring->indices[id].data; > > > + VHOST_LOG(DEBUG, "The DMA finishes updating > > > index %u " > > > + "for the used ring.\n", used_idx); > > > + > > > + dma_vring->copy_done_used =3D used_idx; > > > + vhost_dma_vring_call(device, dma_vring); > > > + put_used_index(dma_vring->indices, > > > + dma_vring->max_indices, id); > > > + } > > > + } > > > + return dma_done; > > > +} > > > + > > > +static __rte_always_inline bool > > > +rxvq_is_mergeable(struct pmd_internal *dev) > > > +{ > > > + return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF); > > > +} > > > + > > > > I'm not sure whether shadow used ring can help in DMA acceleration > > scenario. > > Vhost driver will wait until DMA copy is done. Optimization in CPU move > may > > not help in overall performance but just add weird codes. >=20 > For small copies, we still use the CPU, as the IOAT is less efficient in = small > copies. > Therefore, I think we still need CPU optimization here. >=20 Thanks for clarification, got it. > > > +static __rte_always_inline void > > > +flush_shadow_used_ring_split(struct pmd_internal *dev, > > > + struct dma_vring *dma_vr) > > > +{ > > > + uint16_t used_idx =3D dma_vr->last_used_idx & (dma_vr->vr.size - 1)= ; > > > + > > > + if (used_idx + dma_vr->shadow_used_idx <=3D dma_vr->vr.size) { > > > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0, > > > + dma_vr->shadow_used_idx); > > > + } else { > > > + uint16_t size; > > > + > > > + /* update used ring interval [used_idx, vr->size] */ > > > + size =3D dma_vr->vr.size - used_idx; > > > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0, > > size); > > > + > > > + /* update the left half used ring interval [0, left_size] */ > > > + do_flush_shadow_used_ring_split(dma_vr, 0, size, > > > + dma_vr->shadow_used_idx - > > > + size); > > > + } > > > + dma_vr->last_used_idx +=3D dma_vr->shadow_used_idx; > > > + > > > + rte_smp_wmb(); > > > + > > > + if (dma_vr->nr_inflight > 0) { > > > + struct ring_index *index; > > > + > > > + index =3D get_empty_index(dma_vr->indices, dma_vr- > > > >max_indices); > > > + index->data =3D dma_vr->last_used_idx; > > > + while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id, > > > + index->pa, > > > + dma_vr->used_idx_hpa, > > > + sizeof(uint16_t), > > > + index->idx, 0, 0) =3D=3D > > > + 0)) { > > > + int ret; > > > + > > > + do { > > > + ret =3D dma_vr->dma_done_fn(dev, dma_vr); > > > + } while (ret <=3D 0); > > > + } > > > + dma_vr->nr_batching++; > > > + dma_vr->nr_inflight++; > > > + } else { > > > + /** > > > + * we update index of used ring when all previous copy > > > + * jobs are completed. > > > + * > > > + * When enabling DMA copy, if there are outstanding copy > > > + * jobs of the DMA, to avoid the DMA overwriting the > > > + * write of the CPU, the DMA is in charge of updating > > > + * the index of used ring. > > > + */ > > > > According to comments, here should be DMA data move. But following > code > > is CPU data move. Anything wrong here? >=20 > The update of used index is done by the CPU, if there are no inflight IOA= T > copies; > otherwise, it's done by the IOAT. The code in "else {}" is executed only = when > dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU= is > in > charge of updating used ring's index. >=20 So look like comments can be separated into two parts, the second parts des= cribes the behavior of index update when has infight copies.=20 > > > > > + *(volatile uint16_t *)&dma_vr->vr.used->idx +=3D > > > + dma_vr->shadow_used_idx; > > > + dma_vr->copy_done_used +=3D dma_vr->shadow_used_idx; > > > + } > > > + > > > + dma_vr->shadow_used_idx =3D 0; > > > +} > > > + > > > 2.7.4