From: "Liu, Yong" <yong.liu@intel.com>
To: "Hu, Jiayu" <jiayu.hu@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>,
"Ye, Xiaolong" <xiaolong.ye@intel.com>,
"Wang, Zhihong" <zhihong.wang@intel.com>
Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations
Date: Wed, 18 Mar 2020 01:22:39 +0000 [thread overview]
Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B775@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com>
> -----Original Message-----
> From: Hu, Jiayu <jiayu.hu@intel.com>
> Sent: Tuesday, March 17, 2020 5:31 PM
> To: Liu, Yong <yong.liu@intel.com>; dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> accelerate Tx operations
>
> Hi Marvin,
>
> Thanks for comments. Replies are inline.
>
> > -----Original Message-----
> > From: Liu, Yong <yong.liu@intel.com>
> > Sent: Tuesday, March 17, 2020 3:21 PM
> > To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > accelerate Tx operations
> >
> > Hi Jiayu,
> > Some comments are inline.
> >
> > Thanks,
> > Marvin
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Jiayu Hu
> > > Sent: Tuesday, March 17, 2020 5:21 PM
> > > To: dev@dpdk.org
> > > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> > > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>
> > > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > > accelerate Tx operations
> > >
> > >
> > > int vhost_logtype;
> > > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > > #define ETH_VHOST_IOMMU_SUPPORT "iommu-support"
> > > #define ETH_VHOST_POSTCOPY_SUPPORT "postcopy-support"
> > > #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
> > > +#define ETH_VHOST_DMA_ARG "dmas"
> > > #define VHOST_MAX_PKT_BURST 32
> > >
> > > +/* ring size of I/OAT */
> > > +#define IOAT_RING_SIZE 1024
> > > +
> >
> > Jiayu,
> > Configured I/OAT ring size is 1024 here, but do not see in_flight or
> > nr_batching size check in enqueue function.
> > Is there any possibility that IOAT ring exhausted?
>
> We will wait for IOAT's copy completion, when its ring is full.
> This is to guarantee that all enqueue to IOAT can success.
>
> > > +struct dma_info_input {
> > > + struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > > + uint16_t nr;
> > > +};
> > > +
> > > +static inline int
> > > +open_dma(const char *key __rte_unused, const char *value, void
> > > *extra_args)
> > > +{
> > > + struct dma_info_input *dma_info = extra_args;
> > > + char *input = strndup(value, strlen(value) + 1);
> > > + char *addrs = input;
> > > + char *ptrs[2];
> > > + char *start, *end, *substr;
> > > + int64_t qid, vring_id;
> > > + struct rte_ioat_rawdev_config config;
> > > + struct rte_rawdev_info info = { .dev_private = &config };
> > > + char name[32];
> > > + int dev_id;
> > > + int ret = 0;
> > > +
> > > + while (isblank(*addrs))
> > > + addrs++;
> > > + if (addrs == '\0') {
> > > + VHOST_LOG(ERR, "No input DMA addresses\n");
> > > + ret = -1;
> > > + goto out;
> > > + }
> > > +
> > > + /* process DMA devices within bracket. */
> > > + addrs++;
> > > + substr = strtok(addrs, ";]");
> > > + if (!substr) {
> > > + VHOST_LOG(ERR, "No input DMA addresse\n");
> > > + ret = -1;
> > > + goto out;
> > > + }
> > > +
> > > + do {
> > > + rte_strsplit(substr, strlen(substr), ptrs, 2, '@');
> > > +
> > Function rte_strsplit can be failed. Need to check return value.
>
> Thanks. Will check it later.
>
> >
> > > + start = strstr(ptrs[0], "txq");
> > > + if (start == NULL) {
> > > + VHOST_LOG(ERR, "Illegal queue\n");
> > > + ret = -1;
> > > + goto out;
> > > + }
> > > +
> > > + start += 3;
> >
> > It's better not use hardcode value.
> >
> > > + qid = strtol(start, &end, 0);
> > > + if (end == start) {
> > > + VHOST_LOG(ERR, "No input queue ID\n");
> > > + ret = -1;
> > > + goto out;
> > > + }
> > > +
> > > + vring_id = qid * 2 + VIRTIO_RXQ;
> > > + if (rte_pci_addr_parse(ptrs[1],
> > > + &dma_info->dmas[vring_id].addr) < 0) {
> > > + VHOST_LOG(ERR, "Invalid DMA address %s\n",
> > > ptrs[1]);
> > > + ret = -1;
> > > + goto out;
> > > + }
> > > +
> > > + rte_pci_device_name(&dma_info->dmas[vring_id].addr,
> > > + name, sizeof(name));
> > > + dev_id = rte_rawdev_get_dev_id(name);
> > > + if (dev_id == (uint16_t)(-ENODEV) ||
> > > + dev_id == (uint16_t)(-EINVAL)) {
> > > + VHOST_LOG(ERR, "Cannot find device %s.\n", name);
> > > + ret = -1;
> > > + goto out;
> > > + }
> > > +
> > Multiple queues can't share one IOAT device. Check should be here as it is
> > not allowed.
>
> I just claim it in the doc. Will add the check later.
>
> > > +
> > > +/* notify front-end of enqueued packets */
> > > +static __rte_always_inline void
> > > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring
> > > *dma_vr)
> > > +{
> > > + vhost_vring_call_split(dev, dma_vr);
> > > +}
> > > +
> > > +int
> > > +free_dma_done(void *dev, void *dma_vr)
> > > +{
> > > + uintptr_t flags[255], tmps[255];
> >
> > Please add meaningful macro for 255, not sure why limitation is 255 not
> 256.
>
> The second parameter of rte_ioat_completed_copies() is uint8_t, so the
> max
> value can only be 255. I can replace it with UINT8_MAX later.
>
> >
> > > + int dma_done, i;
> > > + uint16_t used_idx;
> > > + struct pmd_internal *device = dev;
> > > + struct dma_vring *dma_vring = dma_vr;
> > > +
> > > + dma_done = rte_ioat_completed_copies(dma_vring->dev_id, 255,
> > > flags,
> > > + tmps);
> > > + if (unlikely(dma_done <= 0))
> > > + return dma_done;
> > > +
> > > + dma_vring->nr_inflight -= dma_done;
> >
> > Not sure whether DMA engine will return completion as input sequence,
> > mbuf free should after index update done.
>
> IMO, pktmbuf can be freed once the IOAT doesn't occupy it.
> We don't need to wait for the update of used index.
>
> This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1,
> once submit its copy job to the IOAT. If it has N copies that are all offloaded
> to the IOAT, the refcnt will be increased by N. On completion of a IOAT copy,
> if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg().
> When the value of refcnt reaches 1, which means all its copies are
> completed,
> the mbuf will be freed.
>
Thanks for reply.
My concern is that whether IOAT can preserve order as input sequence. If hardware can guarantee that, no more question here.
> >
> > > + for (i = 0; i < dma_done; i++) {
> > > + if ((uint64_t)flags[i] >= dma_vring->max_indices) {
> > > + struct rte_mbuf *pkt = (struct rte_mbuf *)flags[i];
> > > +
> > > + /**
> > > + * the DMA completes a packet copy job, we
> > > + * decrease the refcnt or free the mbuf segment.
> > > + */
> > > + rte_pktmbuf_free_seg(pkt);
> > > + } else {
> > > + uint16_t id = flags[i];
> > > +
> > > + /**
> > > + * the DMA completes updating index of the
> > > + * used ring.
> > > + */
> > > + used_idx = dma_vring->indices[id].data;
> > > + VHOST_LOG(DEBUG, "The DMA finishes updating
> > > index %u "
> > > + "for the used ring.\n", used_idx);
> > > +
> > > + dma_vring->copy_done_used = used_idx;
> > > + vhost_dma_vring_call(device, dma_vring);
> > > + put_used_index(dma_vring->indices,
> > > + dma_vring->max_indices, id);
> > > + }
> > > + }
> > > + return dma_done;
> > > +}
> > > +
> > > +static __rte_always_inline bool
> > > +rxvq_is_mergeable(struct pmd_internal *dev)
> > > +{
> > > + return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > > +}
> > > +
> >
> > I'm not sure whether shadow used ring can help in DMA acceleration
> > scenario.
> > Vhost driver will wait until DMA copy is done. Optimization in CPU move
> may
> > not help in overall performance but just add weird codes.
>
> For small copies, we still use the CPU, as the IOAT is less efficient in small
> copies.
> Therefore, I think we still need CPU optimization here.
>
Thanks for clarification, got it.
> > > +static __rte_always_inline void
> > > +flush_shadow_used_ring_split(struct pmd_internal *dev,
> > > + struct dma_vring *dma_vr)
> > > +{
> > > + uint16_t used_idx = dma_vr->last_used_idx & (dma_vr->vr.size - 1);
> > > +
> > > + if (used_idx + dma_vr->shadow_used_idx <= dma_vr->vr.size) {
> > > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > > + dma_vr->shadow_used_idx);
> > > + } else {
> > > + uint16_t size;
> > > +
> > > + /* update used ring interval [used_idx, vr->size] */
> > > + size = dma_vr->vr.size - used_idx;
> > > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > size);
> > > +
> > > + /* update the left half used ring interval [0, left_size] */
> > > + do_flush_shadow_used_ring_split(dma_vr, 0, size,
> > > + dma_vr->shadow_used_idx -
> > > + size);
> > > + }
> > > + dma_vr->last_used_idx += dma_vr->shadow_used_idx;
> > > +
> > > + rte_smp_wmb();
> > > +
> > > + if (dma_vr->nr_inflight > 0) {
> > > + struct ring_index *index;
> > > +
> > > + index = get_empty_index(dma_vr->indices, dma_vr-
> > > >max_indices);
> > > + index->data = dma_vr->last_used_idx;
> > > + while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id,
> > > + index->pa,
> > > + dma_vr->used_idx_hpa,
> > > + sizeof(uint16_t),
> > > + index->idx, 0, 0) ==
> > > + 0)) {
> > > + int ret;
> > > +
> > > + do {
> > > + ret = dma_vr->dma_done_fn(dev, dma_vr);
> > > + } while (ret <= 0);
> > > + }
> > > + dma_vr->nr_batching++;
> > > + dma_vr->nr_inflight++;
> > > + } else {
> > > + /**
> > > + * we update index of used ring when all previous copy
> > > + * jobs are completed.
> > > + *
> > > + * When enabling DMA copy, if there are outstanding copy
> > > + * jobs of the DMA, to avoid the DMA overwriting the
> > > + * write of the CPU, the DMA is in charge of updating
> > > + * the index of used ring.
> > > + */
> >
> > According to comments, here should be DMA data move. But following
> code
> > is CPU data move. Anything wrong here?
>
> The update of used index is done by the CPU, if there are no inflight IOAT
> copies;
> otherwise, it's done by the IOAT. The code in "else {}" is executed only when
> dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU is
> in
> charge of updating used ring's index.
>
So look like comments can be separated into two parts, the second parts describes the behavior of index update when has infight copies.
> >
> > > + *(volatile uint16_t *)&dma_vr->vr.used->idx +=
> > > + dma_vr->shadow_used_idx;
> > > + dma_vr->copy_done_used += dma_vr->shadow_used_idx;
> > > + }
> > > +
> > > + dma_vr->shadow_used_idx = 0;
> > > +}
> > > +
> > > 2.7.4
next prev parent reply other threads:[~2020-03-18 1:22 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-17 9:21 [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD Jiayu Hu
2020-03-17 9:21 ` [dpdk-dev] [PATCH 1/4] vhost: populate guest memory for DMA-accelerated vhost-user Jiayu Hu
2020-03-17 9:21 ` [dpdk-dev] [PATCH 2/4] net/vhost: setup vrings for DMA-accelerated datapath Jiayu Hu
2020-03-17 6:29 ` Liu, Yong
2020-03-17 9:35 ` Hu, Jiayu
2020-03-18 1:17 ` Liu, Yong
2020-03-17 9:21 ` [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations Jiayu Hu
2020-03-17 7:21 ` Liu, Yong
2020-03-17 9:31 ` Hu, Jiayu
2020-03-18 1:22 ` Liu, Yong [this message]
2020-03-17 9:21 ` [dpdk-dev] [PATCH 4/4] doc: add I/OAT acceleration support for vhost-user PMD Jiayu Hu
2020-03-17 6:36 ` Ye Xiaolong
2020-03-17 9:53 ` [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations " Maxime Coquelin
2020-03-19 7:33 ` Hu, Jiayu
2020-03-19 9:10 ` Maxime Coquelin
2020-03-19 11:47 ` Hu, Jiayu
2020-03-26 7:52 ` Maxime Coquelin
2020-03-26 8:25 ` Hu, Jiayu
2020-03-26 8:47 ` Maxime Coquelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86228AFD5BCD8E4EBFD2B90117B5E81E6350B775@SHSMSX103.ccr.corp.intel.com \
--to=yong.liu@intel.com \
--cc=dev@dpdk.org \
--cc=jiayu.hu@intel.com \
--cc=maxime.coquelin@redhat.com \
--cc=xiaolong.ye@intel.com \
--cc=zhihong.wang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).