From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id EDF8BA0559; Tue, 17 Mar 2020 08:21:08 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 44E592B9E; Tue, 17 Mar 2020 08:21:08 +0100 (CET) Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id DD64525D9 for ; Tue, 17 Mar 2020 08:21:05 +0100 (CET) IronPort-SDR: IbFtUF0f6ujpa1VK1p5bC73XrDaSGBfRT81+EYb97fGjpWghTEKolIJhw9h4/ui5SsH3hEhymb L7ltn6vX0cBA== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2020 00:21:04 -0700 IronPort-SDR: +WAV4zuL8axzqeuRLQU7I5GSbyjNed0HFf6JymXb58kIfEPS38pO9qNW7034aJO4ZHpDP+UxVg ow4ifoHbUa+w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,563,1574150400"; d="scan'208";a="238201788" Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201]) by orsmga008.jf.intel.com with ESMTP; 17 Mar 2020 00:21:04 -0700 Received: from fmsmsx113.amr.corp.intel.com (10.18.116.7) by FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 17 Mar 2020 00:21:03 -0700 Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by FMSMSX113.amr.corp.intel.com (10.18.116.7) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 17 Mar 2020 00:21:03 -0700 Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.137]) by shsmsx102.ccr.corp.intel.com ([169.254.2.50]) with mapi id 14.03.0439.000; Tue, 17 Mar 2020 15:21:00 +0800 From: "Liu, Yong" To: "Hu, Jiayu" , "dev@dpdk.org" CC: "maxime.coquelin@redhat.com" , "Ye, Xiaolong" , "Wang, Zhihong" , "Hu, Jiayu" Thread-Topic: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations Thread-Index: AQHV/AXmSvVjRouUCkiasNlGvhPWIqhMSZsA Date: Tue, 17 Mar 2020 07:21:00 +0000 Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B05E@SHSMSX103.ccr.corp.intel.com> References: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com> <1584436885-18651-4-git-send-email-jiayu.hu@intel.com> In-Reply-To: <1584436885-18651-4-git-send-email-jiayu.hu@intel.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Jiayu, Some comments are inline. Thanks, Marvin > -----Original Message----- > From: dev On Behalf Of Jiayu Hu > Sent: Tuesday, March 17, 2020 5:21 PM > To: dev@dpdk.org > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong ; > Wang, Zhihong ; Hu, Jiayu > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to > accelerate Tx operations >=20 > This patch accelerates large data movement in Tx operations via DMA > engines, like I/OAT, the DMA engine in Intel's processors. >=20 > Large copies are offloaded from the CPU to the DMA engine in an > asynchronous manner. The CPU just submits copy jobs to the DMA engine > and without waiting for DMA copy completion; there is no CPU intervention > during DMA data transfer. By overlapping CPU computation and DMA copy, > we can save precious CPU cycles and improve the overall throughput for > vhost-user PMD based applications, like OVS. Due to startup overheads > associated with DMA engines, small copies are performed by the CPU. >=20 > Note that vhost-user PMD can support various DMA engines, but it just > supports I/OAT devices currently. In addition, I/OAT acceleration > is only enabled for split rings. >=20 > DMA devices used by queues are assigned by users; for a queue without > assigning a DMA device, the PMD will leverages librte_vhost to perform > Tx operations. A queue can only be assigned one I/OAT device, and > an I/OAT device can only be used by one queue. >=20 > We introduce a new vdev parameter to enable DMA acceleration for Tx > operations of queues: > - dmas: This parameter is used to specify the assigned DMA device of > a queue. > Here is an example: > $ ./testpmd -c f -n 4 \ > --vdev 'net_vhost0,iface=3D/tmp/s0,queues=3D1,dmas=3D[txq0@00:04.0]' >=20 > Signed-off-by: Jiayu Hu > --- > drivers/net/vhost/Makefile | 2 +- > drivers/net/vhost/internal.h | 19 + > drivers/net/vhost/meson.build | 2 +- > drivers/net/vhost/rte_eth_vhost.c | 252 ++++++++++++- > drivers/net/vhost/virtio_net.c | 742 > ++++++++++++++++++++++++++++++++++++++ > drivers/net/vhost/virtio_net.h | 120 ++++++ > 6 files changed, 1120 insertions(+), 17 deletions(-) >=20 > diff --git a/drivers/net/vhost/Makefile b/drivers/net/vhost/Makefile > index 19cae52..87dfb14 100644 > --- a/drivers/net/vhost/Makefile > +++ b/drivers/net/vhost/Makefile > @@ -11,7 +11,7 @@ LIB =3D librte_pmd_vhost.a > LDLIBS +=3D -lpthread > LDLIBS +=3D -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring > LDLIBS +=3D -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost > -LDLIBS +=3D -lrte_bus_vdev > +LDLIBS +=3D -lrte_bus_vdev -lrte_rawdev_ioat >=20 > CFLAGS +=3D -O3 > CFLAGS +=3D $(WERROR_FLAGS) > diff --git a/drivers/net/vhost/internal.h b/drivers/net/vhost/internal.h > index 7588fdf..f19ed7a 100644 > --- a/drivers/net/vhost/internal.h > +++ b/drivers/net/vhost/internal.h > @@ -20,6 +20,8 @@ extern int vhost_logtype; > #define VHOST_LOG(level, ...) \ > rte_log(RTE_LOG_ ## level, vhost_logtype, __VA_ARGS__) >=20 > +typedef int (*process_dma_done_fn)(void *dev, void *dma_vr); > + > enum vhost_xstats_pkts { > VHOST_UNDERSIZE_PKT =3D 0, > VHOST_64_PKT, > @@ -96,6 +98,11 @@ struct dma_vring { > * used by the DMA. > */ > phys_addr_t used_idx_hpa; > + > + struct ring_index *indices; > + uint16_t max_indices; > + > + process_dma_done_fn dma_done_fn; > }; >=20 > struct vhost_queue { > @@ -110,6 +117,13 @@ struct vhost_queue { > struct dma_vring *dma_vring; > }; >=20 > +struct dma_info { > + process_dma_done_fn dma_done_fn; > + struct rte_pci_addr addr; > + uint16_t dev_id; > + bool is_valid; > +}; > + > struct pmd_internal { > rte_atomic32_t dev_attached; > char *iface_name; > @@ -132,6 +146,11 @@ struct pmd_internal { > /* negotiated features */ > uint64_t features; > size_t hdr_len; > + bool vring_setup_done; > + bool guest_mem_populated; > + > + /* User-assigned DMA information */ > + struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2]; > }; >=20 > #ifdef __cplusplus > diff --git a/drivers/net/vhost/meson.build b/drivers/net/vhost/meson.buil= d > index b308dcb..af3c640 100644 > --- a/drivers/net/vhost/meson.build > +++ b/drivers/net/vhost/meson.build > @@ -6,4 +6,4 @@ reason =3D 'missing dependency, DPDK vhost library' > sources =3D files('rte_eth_vhost.c', > 'virtio_net.c') > install_headers('rte_eth_vhost.h') > -deps +=3D 'vhost' > +deps +=3D ['vhost', 'rawdev'] > diff --git a/drivers/net/vhost/rte_eth_vhost.c > b/drivers/net/vhost/rte_eth_vhost.c > index b5c927c..9faaa02 100644 > --- a/drivers/net/vhost/rte_eth_vhost.c > +++ b/drivers/net/vhost/rte_eth_vhost.c > @@ -15,8 +15,12 @@ > #include > #include > #include > +#include > +#include > +#include >=20 > #include "internal.h" > +#include "virtio_net.h" > #include "rte_eth_vhost.h" >=20 > int vhost_logtype; > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; > #define ETH_VHOST_IOMMU_SUPPORT "iommu-support" > #define ETH_VHOST_POSTCOPY_SUPPORT "postcopy-support" > #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso" > +#define ETH_VHOST_DMA_ARG "dmas" > #define VHOST_MAX_PKT_BURST 32 >=20 > +/* ring size of I/OAT */ > +#define IOAT_RING_SIZE 1024 > + Jiayu, Configured I/OAT ring size is 1024 here, but do not see in_flight or nr_bat= ching size check in enqueue function. Is there any possibility that IOAT ring exhausted? > static const char *valid_arguments[] =3D { > ETH_VHOST_IFACE_ARG, > ETH_VHOST_QUEUES_ARG, > @@ -40,6 +48,7 @@ static const char *valid_arguments[] =3D { > ETH_VHOST_IOMMU_SUPPORT, > ETH_VHOST_POSTCOPY_SUPPORT, > ETH_VHOST_VIRTIO_NET_F_HOST_TSO, > + ETH_VHOST_DMA_ARG, > NULL > }; >=20 > @@ -377,6 +386,7 @@ static uint16_t > eth_vhost_tx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs) > { > struct vhost_queue *r =3D q; > + struct pmd_internal *dev =3D r->internal; > uint16_t i, nb_tx =3D 0; > uint16_t nb_send =3D 0; >=20 > @@ -405,18 +415,33 @@ eth_vhost_tx(void *q, struct rte_mbuf **bufs, > uint16_t nb_bufs) > } >=20 > /* Enqueue packets to guest RX queue */ > - while (nb_send) { > - uint16_t nb_pkts; > - uint16_t num =3D (uint16_t)RTE_MIN(nb_send, > - VHOST_MAX_PKT_BURST); > - > - nb_pkts =3D rte_vhost_enqueue_burst(r->vid, r->virtqueue_id, > - &bufs[nb_tx], num); > - > - nb_tx +=3D nb_pkts; > - nb_send -=3D nb_pkts; > - if (nb_pkts < num) > - break; > + if (!r->dma_vring->dma_enabled) { > + while (nb_send) { > + uint16_t nb_pkts; > + uint16_t num =3D (uint16_t)RTE_MIN(nb_send, > + VHOST_MAX_PKT_BURST); > + > + nb_pkts =3D rte_vhost_enqueue_burst(r->vid, > + r->virtqueue_id, > + &bufs[nb_tx], num); > + nb_tx +=3D nb_pkts; > + nb_send -=3D nb_pkts; > + if (nb_pkts < num) > + break; > + } > + } else { > + while (nb_send) { > + uint16_t nb_pkts; > + uint16_t num =3D (uint16_t)RTE_MIN(nb_send, > + > VHOST_MAX_PKT_BURST); > + > + nb_pkts =3D vhost_dma_enqueue_burst(dev, r- > >dma_vring, > + &bufs[nb_tx], num); > + nb_tx +=3D nb_pkts; > + nb_send -=3D nb_pkts; > + if (nb_pkts < num) > + break; > + } > } >=20 > r->stats.pkts +=3D nb_tx; > @@ -434,6 +459,7 @@ eth_vhost_tx(void *q, struct rte_mbuf **bufs, > uint16_t nb_bufs) > for (i =3D nb_tx; i < nb_bufs; i++) > vhost_count_multicast_broadcast(r, bufs[i]); >=20 > + /* Only DMA non-occupied mbuf segments will be freed */ > for (i =3D 0; likely(i < nb_tx); i++) > rte_pktmbuf_free(bufs[i]); > out: > @@ -483,6 +509,12 @@ eth_rxq_intr_enable(struct rte_eth_dev *dev, > uint16_t qid) > return -1; > } >=20 > + if (vq->dma_vring->dma_enabled) { > + VHOST_LOG(INFO, "Don't support interrupt when DMA " > + "acceleration is enabled\n"); > + return -1; > + } > + > ret =3D rte_vhost_get_vhost_vring(vq->vid, (qid << 1) + 1, &vring); > if (ret < 0) { > VHOST_LOG(ERR, "Failed to get rxq%d's vring\n", qid); > @@ -508,6 +540,12 @@ eth_rxq_intr_disable(struct rte_eth_dev *dev, > uint16_t qid) > return -1; > } >=20 > + if (vq->dma_vring->dma_enabled) { > + VHOST_LOG(INFO, "Don't support interrupt when DMA " > + "acceleration is enabled\n"); > + return -1; > + } > + > ret =3D rte_vhost_get_vhost_vring(vq->vid, (qid << 1) + 1, &vring); > if (ret < 0) { > VHOST_LOG(ERR, "Failed to get rxq%d's vring", qid); > @@ -692,6 +730,13 @@ new_device(int vid) > #endif >=20 > internal->vid =3D vid; > + if (internal->guest_mem_populated && > vhost_dma_setup(internal) >=3D 0) > + internal->vring_setup_done =3D true; > + else { > + VHOST_LOG(INFO, "Not setup vrings for DMA > acceleration.\n"); > + internal->vring_setup_done =3D false; > + } > + > if (rte_atomic32_read(&internal->started) =3D=3D 1) { > queue_setup(eth_dev, internal); >=20 > @@ -747,6 +792,11 @@ destroy_device(int vid) > update_queuing_status(eth_dev); >=20 > eth_dev->data->dev_link.link_status =3D ETH_LINK_DOWN; > + /** > + * before destroy guest's vrings, I/O threads have > + * to stop accessing queues. > + */ > + vhost_dma_remove(internal); >=20 > if (eth_dev->data->rx_queues && eth_dev->data->tx_queues) { > for (i =3D 0; i < eth_dev->data->nb_rx_queues; i++) { > @@ -785,6 +835,11 @@ vring_state_changed(int vid, uint16_t vring, int > enable) > struct rte_eth_dev *eth_dev; > struct internal_list *list; > char ifname[PATH_MAX]; > + struct pmd_internal *dev; > + struct dma_vring *dma_vr; > + struct rte_ioat_rawdev_config config; > + struct rte_rawdev_info info =3D { .dev_private =3D &config }; > + char name[32]; >=20 > rte_vhost_get_ifname(vid, ifname, sizeof(ifname)); > list =3D find_internal_resource(ifname); > @@ -794,6 +849,53 @@ vring_state_changed(int vid, uint16_t vring, int > enable) > } >=20 > eth_dev =3D list->eth_dev; > + dev =3D eth_dev->data->dev_private; > + > + /* if fail to set up vrings, return. */ > + if (!dev->vring_setup_done) > + goto out; > + > + /* DMA acceleration just supports split rings. */ > + if (vhost_dma_vring_is_packed(dev)) { > + VHOST_LOG(INFO, "DMA acceleration just supports split " > + "rings.\n"); > + goto out; > + } > + > + /* if the vring was not given a DMA device, return. */ > + if (!dev->dmas[vring].is_valid) > + goto out; > + > + /** > + * a vring can only use one DMA device. If it has been > + * assigned one, return. > + */ > + dma_vr =3D &dev->dma_vrings[vring]; > + if (dma_vr->dma_enabled) > + goto out; > + > + rte_pci_device_name(&dev->dmas[vring].addr, name, sizeof(name)); > + rte_rawdev_info_get(dev->dmas[vring].dev_id, &info); > + config.ring_size =3D IOAT_RING_SIZE; > + if (rte_rawdev_configure(dev->dmas[vring].dev_id, &info) < 0) { > + VHOST_LOG(ERR, "Config the DMA device %s failed\n", > name); > + goto out; > + } > + > + rte_rawdev_start(dev->dmas[vring].dev_id); > + > + memcpy(&dma_vr->dma_addr, &dev->dmas[vring].addr, > + sizeof(struct rte_pci_addr)); > + dma_vr->dev_id =3D dev->dmas[vring].dev_id; > + dma_vr->dma_enabled =3D true; > + dma_vr->nr_inflight =3D 0; > + dma_vr->nr_batching =3D 0; > + dma_vr->dma_done_fn =3D dev->dmas[vring].dma_done_fn; > + > + VHOST_LOG(INFO, "Attach the DMA %s to vring %u of port %u\n", > + name, vring, eth_dev->data->port_id); > + > +out: > /* won't be NULL */ > state =3D vring_states[eth_dev->data->port_id]; > rte_spinlock_lock(&state->lock); > @@ -1239,7 +1341,7 @@ static const struct eth_dev_ops ops =3D { > static int > eth_dev_vhost_create(struct rte_vdev_device *dev, char *iface_name, > int16_t queues, const unsigned int numa_node, uint64_t flags, > - uint64_t disable_flags) > + uint64_t disable_flags, struct dma_info *dmas) > { > const char *name =3D rte_vdev_device_name(dev); > struct rte_eth_dev_data *data; > @@ -1290,6 +1392,13 @@ eth_dev_vhost_create(struct rte_vdev_device > *dev, char *iface_name, > eth_dev->rx_pkt_burst =3D eth_vhost_rx; > eth_dev->tx_pkt_burst =3D eth_vhost_tx; >=20 > + memcpy(internal->dmas, dmas, sizeof(struct dma_info) * 2 * > + RTE_MAX_QUEUES_PER_PORT); > + if (flags & RTE_VHOST_USER_DMA_COPY) > + internal->guest_mem_populated =3D true; > + else > + internal->guest_mem_populated =3D false; > + > rte_eth_dev_probing_finish(eth_dev); > return 0; >=20 > @@ -1329,6 +1438,100 @@ open_int(const char *key __rte_unused, const > char *value, void *extra_args) > return 0; > } >=20 > +struct dma_info_input { > + struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2]; > + uint16_t nr; > +}; > + > +static inline int > +open_dma(const char *key __rte_unused, const char *value, void > *extra_args) > +{ > + struct dma_info_input *dma_info =3D extra_args; > + char *input =3D strndup(value, strlen(value) + 1); > + char *addrs =3D input; > + char *ptrs[2]; > + char *start, *end, *substr; > + int64_t qid, vring_id; > + struct rte_ioat_rawdev_config config; > + struct rte_rawdev_info info =3D { .dev_private =3D &config }; > + char name[32]; > + int dev_id; > + int ret =3D 0; > + > + while (isblank(*addrs)) > + addrs++; > + if (addrs =3D=3D '\0') { > + VHOST_LOG(ERR, "No input DMA addresses\n"); > + ret =3D -1; > + goto out; > + } > + > + /* process DMA devices within bracket. */ > + addrs++; > + substr =3D strtok(addrs, ";]"); > + if (!substr) { > + VHOST_LOG(ERR, "No input DMA addresse\n"); > + ret =3D -1; > + goto out; > + } > + > + do { > + rte_strsplit(substr, strlen(substr), ptrs, 2, '@'); > + Function rte_strsplit can be failed. Need to check return value. > + start =3D strstr(ptrs[0], "txq"); > + if (start =3D=3D NULL) { > + VHOST_LOG(ERR, "Illegal queue\n"); > + ret =3D -1; > + goto out; > + } > + > + start +=3D 3; It's better not use hardcode value. > + qid =3D strtol(start, &end, 0); > + if (end =3D=3D start) { > + VHOST_LOG(ERR, "No input queue ID\n"); > + ret =3D -1; > + goto out; > + } > + > + vring_id =3D qid * 2 + VIRTIO_RXQ; > + if (rte_pci_addr_parse(ptrs[1], > + &dma_info->dmas[vring_id].addr) < 0) { > + VHOST_LOG(ERR, "Invalid DMA address %s\n", > ptrs[1]); > + ret =3D -1; > + goto out; > + } > + > + rte_pci_device_name(&dma_info->dmas[vring_id].addr, > + name, sizeof(name)); > + dev_id =3D rte_rawdev_get_dev_id(name); > + if (dev_id =3D=3D (uint16_t)(-ENODEV) || > + dev_id =3D=3D (uint16_t)(-EINVAL)) { > + VHOST_LOG(ERR, "Cannot find device %s.\n", name); > + ret =3D -1; > + goto out; > + } > + Multiple queues can't share one IOAT device. Check should be here as it is = not allowed. > + if (rte_rawdev_info_get(dev_id, &info) < 0 || > + strstr(info.driver_name, "ioat") =3D=3D NULL) { > + VHOST_LOG(ERR, "The input device %s is invalid or " > + "it is not an I/OAT device\n", name); > + ret =3D -1; > + goto out; > + } > + > + dma_info->dmas[vring_id].dev_id =3D dev_id; > + dma_info->dmas[vring_id].is_valid =3D true; > + dma_info->dmas[vring_id].dma_done_fn =3D free_dma_done; > + dma_info->nr++; > + > + substr =3D strtok(NULL, ";]"); > + } while (substr); > + > +out: > + free(input); > + return ret; > +} > + > static int > rte_pmd_vhost_probe(struct rte_vdev_device *dev) > { > @@ -1345,6 +1548,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device > *dev) > int tso =3D 0; > struct rte_eth_dev *eth_dev; > const char *name =3D rte_vdev_device_name(dev); > + struct dma_info_input dma_info =3D {0}; >=20 > VHOST_LOG(INFO, "Initializing pmd_vhost for %s\n", name); >=20 > @@ -1440,11 +1644,28 @@ rte_pmd_vhost_probe(struct rte_vdev_device > *dev) > } > } >=20 > + if (rte_kvargs_count(kvlist, ETH_VHOST_DMA_ARG) =3D=3D 1) { > + ret =3D rte_kvargs_process(kvlist, ETH_VHOST_DMA_ARG, > + &open_dma, &dma_info); > + if (ret < 0) > + goto out_free; > + > + if (dma_info.nr > 0) { > + flags |=3D RTE_VHOST_USER_DMA_COPY; > + /** > + * don't support live migration when enable > + * DMA acceleration. > + */ > + disable_flags |=3D (1ULL << VHOST_F_LOG_ALL); > + } > + } > + > if (dev->device.numa_node =3D=3D SOCKET_ID_ANY) > dev->device.numa_node =3D rte_socket_id(); >=20 > ret =3D eth_dev_vhost_create(dev, iface_name, queues, > - dev->device.numa_node, flags, > disable_flags); > + dev->device.numa_node, flags, > + disable_flags, dma_info.dmas); > if (ret =3D=3D -1) > VHOST_LOG(ERR, "Failed to create %s\n", name); >=20 > @@ -1491,7 +1712,8 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost, > "dequeue-zero-copy=3D<0|1> " > "iommu-support=3D<0|1> " > "postcopy-support=3D<0|1> " > - "tso=3D<0|1>"); > + "tso=3D<0|1> " > + "dmas=3D[txq0@addr0;txq1@addr1]"); >=20 > RTE_INIT(vhost_init_log) > { > diff --git a/drivers/net/vhost/virtio_net.c b/drivers/net/vhost/virtio_ne= t.c > index 11591c0..e7ba5b3 100644 > --- a/drivers/net/vhost/virtio_net.c > +++ b/drivers/net/vhost/virtio_net.c > @@ -2,11 +2,735 @@ > #include > #include >=20 > +#include > +#include > #include > +#include > +#include > +#include > +#include > +#include > #include > +#include > +#include >=20 > #include "virtio_net.h" >=20 > +#define BUF_VECTOR_MAX 256 > +#define MAX_BATCH_LEN 256 > + > +struct buf_vector { > + uint64_t buf_iova; > + uint64_t buf_addr; > + uint32_t buf_len; > + uint32_t desc_idx; > +}; > + > +static __rte_always_inline int > +vhost_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old) > +{ > + return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - > old); > +} > + > +static __rte_always_inline void > +vhost_vring_call_split(struct pmd_internal *dev, struct dma_vring > *dma_vr) > +{ > + struct rte_vhost_vring *vr =3D &dma_vr->vr; > + > + /* flush used->idx update before we read avail->flags. */ > + rte_smp_mb(); > + > + if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) { > + uint16_t old =3D dma_vr->signalled_used; > + uint16_t new =3D dma_vr->copy_done_used; > + bool signalled_used_valid =3D dma_vr->signalled_used_valid; > + > + dma_vr->signalled_used =3D new; > + dma_vr->signalled_used_valid =3D true; > + > + VHOST_LOG(DEBUG, "%s: used_event_idx=3D%d, old=3D%d, > new=3D%d\n", > + __func__, vhost_used_event(vr), old, new); > + > + if ((vhost_need_event(vhost_used_event(vr), new, old) && > + (vr->callfd >=3D 0)) || unlikely(!signalled_used_valid)) > + eventfd_write(vr->callfd, (eventfd_t)1); > + } else { > + if (!(vr->avail->flags & VRING_AVAIL_F_NO_INTERRUPT) && > + (vr->callfd >=3D 0)) > + eventfd_write(vr->callfd, (eventfd_t)1); > + } > +} > + > +/* notify front-end of enqueued packets */ > +static __rte_always_inline void > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring > *dma_vr) > +{ > + vhost_vring_call_split(dev, dma_vr); > +} > + > +int > +free_dma_done(void *dev, void *dma_vr) > +{ > + uintptr_t flags[255], tmps[255]; Please add meaningful macro for 255, not sure why limitation is 255 not 256= . > + int dma_done, i; > + uint16_t used_idx; > + struct pmd_internal *device =3D dev; > + struct dma_vring *dma_vring =3D dma_vr; > + > + dma_done =3D rte_ioat_completed_copies(dma_vring->dev_id, 255, > flags, > + tmps); > + if (unlikely(dma_done <=3D 0)) > + return dma_done; > + > + dma_vring->nr_inflight -=3D dma_done; Not sure whether DMA engine will return completion as input sequence, mbuf= free should after index update done.=20 > + for (i =3D 0; i < dma_done; i++) { > + if ((uint64_t)flags[i] >=3D dma_vring->max_indices) { > + struct rte_mbuf *pkt =3D (struct rte_mbuf *)flags[i]; > + > + /** > + * the DMA completes a packet copy job, we > + * decrease the refcnt or free the mbuf segment. > + */ > + rte_pktmbuf_free_seg(pkt); > + } else { > + uint16_t id =3D flags[i]; > + > + /** > + * the DMA completes updating index of the > + * used ring. > + */ > + used_idx =3D dma_vring->indices[id].data; > + VHOST_LOG(DEBUG, "The DMA finishes updating > index %u " > + "for the used ring.\n", used_idx); > + > + dma_vring->copy_done_used =3D used_idx; > + vhost_dma_vring_call(device, dma_vring); > + put_used_index(dma_vring->indices, > + dma_vring->max_indices, id); > + } > + } > + return dma_done; > +} > + > +static __rte_always_inline bool > +rxvq_is_mergeable(struct pmd_internal *dev) > +{ > + return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF); > +} > + I'm not sure whether shadow used ring can help in DMA acceleration scenario= .=20 Vhost driver will wait until DMA copy is done. Optimization in CPU move may= not help in overall performance but just add weird codes. > +static __rte_always_inline void > +do_flush_shadow_used_ring_split(struct dma_vring *dma_vr, uint16_t to, > + uint16_t from, uint16_t size) > +{ > + rte_memcpy(&dma_vr->vr.used->ring[to], > + &dma_vr->shadow_used_split[from], > + size * sizeof(struct vring_used_elem)); > +} > + > +static __rte_always_inline void > +flush_shadow_used_ring_split(struct pmd_internal *dev, > + struct dma_vring *dma_vr) > +{ > + uint16_t used_idx =3D dma_vr->last_used_idx & (dma_vr->vr.size - 1); > + > + if (used_idx + dma_vr->shadow_used_idx <=3D dma_vr->vr.size) { > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0, > + dma_vr->shadow_used_idx); > + } else { > + uint16_t size; > + > + /* update used ring interval [used_idx, vr->size] */ > + size =3D dma_vr->vr.size - used_idx; > + do_flush_shadow_used_ring_split(dma_vr, used_idx, 0, size); > + > + /* update the left half used ring interval [0, left_size] */ > + do_flush_shadow_used_ring_split(dma_vr, 0, size, > + dma_vr->shadow_used_idx - > + size); > + } > + dma_vr->last_used_idx +=3D dma_vr->shadow_used_idx; > + > + rte_smp_wmb(); > + > + if (dma_vr->nr_inflight > 0) { > + struct ring_index *index; > + > + index =3D get_empty_index(dma_vr->indices, dma_vr- > >max_indices); > + index->data =3D dma_vr->last_used_idx; > + while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id, > + index->pa, > + dma_vr->used_idx_hpa, > + sizeof(uint16_t), > + index->idx, 0, 0) =3D=3D > + 0)) { > + int ret; > + > + do { > + ret =3D dma_vr->dma_done_fn(dev, dma_vr); > + } while (ret <=3D 0); > + } > + dma_vr->nr_batching++; > + dma_vr->nr_inflight++; > + } else { > + /** > + * we update index of used ring when all previous copy > + * jobs are completed. > + * > + * When enabling DMA copy, if there are outstanding copy > + * jobs of the DMA, to avoid the DMA overwriting the > + * write of the CPU, the DMA is in charge of updating > + * the index of used ring. > + */ According to comments, here should be DMA data move. But following code is = CPU data move. Anything wrong here? > + *(volatile uint16_t *)&dma_vr->vr.used->idx +=3D > + dma_vr->shadow_used_idx; > + dma_vr->copy_done_used +=3D dma_vr->shadow_used_idx; > + } > + > + dma_vr->shadow_used_idx =3D 0; > +} > + > +static __rte_always_inline void > +update_shadow_used_ring_split(struct dma_vring *dma_vr, > + uint16_t desc_idx, uint32_t len) > +{ > + uint16_t i =3D dma_vr->shadow_used_idx++; > + > + dma_vr->shadow_used_split[i].id =3D desc_idx; > + dma_vr->shadow_used_split[i].len =3D len; > +} > + > +static inline void > +do_data_copy(struct dma_vring *dma_vr) > +{ > + struct batch_copy_elem *elem =3D dma_vr->batch_copy_elems; > + uint16_t count =3D dma_vr->batch_copy_nb_elems; > + int i; > + > + for (i =3D 0; i < count; i++) > + rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); > + > + dma_vr->batch_copy_nb_elems =3D 0; > +} > + > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > + if ((var) !=3D (val)) \ > + (var) =3D (val); \ > +} while (0) > + > +static __rte_always_inline void > +virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr > *net_hdr) > +{ > + uint64_t csum_l4 =3D m_buf->ol_flags & PKT_TX_L4_MASK; > + > + if (m_buf->ol_flags & PKT_TX_TCP_SEG) > + csum_l4 |=3D PKT_TX_TCP_CKSUM; > + > + if (csum_l4) { > + net_hdr->flags =3D VIRTIO_NET_HDR_F_NEEDS_CSUM; > + net_hdr->csum_start =3D m_buf->l2_len + m_buf->l3_len; > + > + switch (csum_l4) { > + case PKT_TX_TCP_CKSUM: > + net_hdr->csum_offset =3D (offsetof(struct rte_tcp_hdr, > + cksum)); > + break; > + case PKT_TX_UDP_CKSUM: > + net_hdr->csum_offset =3D (offsetof(struct rte_udp_hdr, > + dgram_cksum)); > + break; > + case PKT_TX_SCTP_CKSUM: > + net_hdr->csum_offset =3D (offsetof(struct rte_sctp_hdr, > + cksum)); > + break; > + } > + } else { > + ASSIGN_UNLESS_EQUAL(net_hdr->csum_start, 0); > + ASSIGN_UNLESS_EQUAL(net_hdr->csum_offset, 0); > + ASSIGN_UNLESS_EQUAL(net_hdr->flags, 0); > + } > + > + /* IP cksum verification cannot be bypassed, then calculate here */ > + if (m_buf->ol_flags & PKT_TX_IP_CKSUM) { > + struct rte_ipv4_hdr *ipv4_hdr; > + > + ipv4_hdr =3D rte_pktmbuf_mtod_offset(m_buf, struct > rte_ipv4_hdr *, > + m_buf->l2_len); > + ipv4_hdr->hdr_checksum =3D rte_ipv4_cksum(ipv4_hdr); > + } > + > + if (m_buf->ol_flags & PKT_TX_TCP_SEG) { > + if (m_buf->ol_flags & PKT_TX_IPV4) > + net_hdr->gso_type =3D VIRTIO_NET_HDR_GSO_TCPV4; > + else > + net_hdr->gso_type =3D VIRTIO_NET_HDR_GSO_TCPV6; > + net_hdr->gso_size =3D m_buf->tso_segsz; > + net_hdr->hdr_len =3D m_buf->l2_len + m_buf->l3_len > + + m_buf->l4_len; > + } else if (m_buf->ol_flags & PKT_TX_UDP_SEG) { > + net_hdr->gso_type =3D VIRTIO_NET_HDR_GSO_UDP; > + net_hdr->gso_size =3D m_buf->tso_segsz; > + net_hdr->hdr_len =3D m_buf->l2_len + m_buf->l3_len + > + m_buf->l4_len; > + } else { > + ASSIGN_UNLESS_EQUAL(net_hdr->gso_type, 0); > + ASSIGN_UNLESS_EQUAL(net_hdr->gso_size, 0); > + ASSIGN_UNLESS_EQUAL(net_hdr->hdr_len, 0); > + } > +} > + > +static __rte_always_inline void * > +vhost_alloc_copy_ind_table(struct pmd_internal *dev, uint64_t desc_addr, > + uint64_t desc_len) > +{ > + void *idesc; > + uint64_t src, dst; > + uint64_t len, remain =3D desc_len; > + > + idesc =3D rte_malloc(NULL, desc_len, 0); > + if (unlikely(!idesc)) > + return NULL; > + > + dst =3D (uint64_t)(uintptr_t)idesc; > + > + while (remain) { > + len =3D remain; > + src =3D rte_vhost_va_from_guest_pa(dev->mem, desc_addr, > &len); > + if (unlikely(!src || !len)) { > + rte_free(idesc); > + return NULL; > + } > + > + rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, > + len); > + > + remain -=3D len; > + dst +=3D len; > + desc_addr +=3D len; > + } > + > + return idesc; > +} > + > +static __rte_always_inline void > +free_ind_table(void *idesc) > +{ > + rte_free(idesc); > +} > + > +static __rte_always_inline int > +map_one_desc(struct pmd_internal *dev, struct buf_vector *buf_vec, > + uint16_t *vec_idx, uint64_t desc_iova, uint64_t desc_len) > +{ > + uint16_t vec_id =3D *vec_idx; > + > + while (desc_len) { > + uint64_t desc_addr; > + uint64_t desc_chunck_len =3D desc_len; > + > + if (unlikely(vec_id >=3D BUF_VECTOR_MAX)) > + return -1; > + > + desc_addr =3D rte_vhost_va_from_guest_pa(dev->mem, > desc_iova, > + &desc_chunck_len); > + if (unlikely(!desc_addr)) > + return -1; > + > + rte_prefetch0((void *)(uintptr_t)desc_addr); > + > + buf_vec[vec_id].buf_iova =3D desc_iova; > + buf_vec[vec_id].buf_addr =3D desc_addr; > + buf_vec[vec_id].buf_len =3D desc_chunck_len; > + > + desc_len -=3D desc_chunck_len; > + desc_iova +=3D desc_chunck_len; > + vec_id++; > + } > + *vec_idx =3D vec_id; > + > + return 0; > +} > + > +static __rte_always_inline int > +fill_vec_buf_split(struct pmd_internal *dev, struct dma_vring *dma_vr, > + uint32_t avail_idx, uint16_t *vec_idx, > + struct buf_vector *buf_vec, uint16_t *desc_chain_head, > + uint32_t *desc_chain_len) > +{ > + struct rte_vhost_vring *vr =3D &dma_vr->vr; > + uint16_t idx =3D vr->avail->ring[avail_idx & (vr->size - 1)]; > + uint16_t vec_id =3D *vec_idx; > + uint32_t len =3D 0; > + uint64_t dlen; > + uint32_t nr_descs =3D vr->size; > + uint32_t cnt =3D 0; > + struct vring_desc *descs =3D vr->desc; > + struct vring_desc *idesc =3D NULL; > + > + if (unlikely(idx >=3D vr->size)) > + return -1; > + > + *desc_chain_head =3D idx; > + > + if (vr->desc[idx].flags & VRING_DESC_F_INDIRECT) { > + dlen =3D vr->desc[idx].len; > + nr_descs =3D dlen / sizeof(struct vring_desc); > + if (unlikely(nr_descs > vr->size)) > + return -1; > + > + descs =3D (struct vring_desc *)(uintptr_t) > + rte_vhost_va_from_guest_pa(dev->mem, > + vr->desc[idx].addr, &dlen); > + if (unlikely(!descs)) > + return -1; > + > + if (unlikely(dlen < vr->desc[idx].len)) { > + /** > + * the indirect desc table is not contiguous > + * in process VA space, we have to copy it. > + */ > + idesc =3D vhost_alloc_copy_ind_table(dev, > + vr->desc[idx].addr, > + vr->desc[idx].len); > + if (unlikely(!idesc)) > + return -1; > + > + descs =3D idesc; > + } > + > + idx =3D 0; > + } > + > + while (1) { > + if (unlikely(idx >=3D nr_descs || cnt++ >=3D nr_descs)) { > + free_ind_table(idesc); > + return -1; > + } > + > + len +=3D descs[idx].len; > + > + if (unlikely(map_one_desc(dev, buf_vec, &vec_id, > + descs[idx].addr, descs[idx].len))) { > + free_ind_table(idesc); > + return -1; > + } > + > + if ((descs[idx].flags & VRING_DESC_F_NEXT) =3D=3D 0) > + break; > + > + idx =3D descs[idx].next; > + } > + > + *desc_chain_len =3D len; > + *vec_idx =3D vec_id; > + > + if (unlikely(!!idesc)) > + free_ind_table(idesc); > + > + return 0; > +} > + > +static inline int > +reserve_avail_buf_split(struct pmd_internal *dev, struct dma_vring > *dma_vr, > + uint32_t size, struct buf_vector *buf_vec, > + uint16_t *num_buffers, uint16_t avail_head, > + uint16_t *nr_vec) > +{ > + struct rte_vhost_vring *vr =3D &dma_vr->vr; > + > + uint16_t cur_idx; > + uint16_t vec_idx =3D 0; > + uint16_t max_tries, tries =3D 0; > + > + uint16_t head_idx =3D 0; > + uint32_t len =3D 0; > + > + *num_buffers =3D 0; > + cur_idx =3D dma_vr->last_avail_idx; > + > + if (rxvq_is_mergeable(dev)) > + max_tries =3D vr->size - 1; > + else > + max_tries =3D 1; > + > + while (size > 0) { > + if (unlikely(cur_idx =3D=3D avail_head)) > + return -1; > + /** > + * if we tried all available ring items, and still > + * can't get enough buf, it means something abnormal > + * happened. > + */ > + if (unlikely(++tries > max_tries)) > + return -1; > + > + if (unlikely(fill_vec_buf_split(dev, dma_vr, cur_idx, > + &vec_idx, buf_vec, > + &head_idx, &len) < 0)) > + return -1; > + len =3D RTE_MIN(len, size); > + update_shadow_used_ring_split(dma_vr, head_idx, len); > + size -=3D len; > + > + cur_idx++; > + *num_buffers +=3D 1; > + } > + > + *nr_vec =3D vec_idx; > + > + return 0; > +} > + > +static __rte_noinline void > +copy_vnet_hdr_to_desc(struct pmd_internal *dev, struct buf_vector > *buf_vec, > + struct virtio_net_hdr_mrg_rxbuf *hdr) > +{ > + uint64_t len; > + uint64_t remain =3D dev->hdr_len; > + uint64_t src =3D (uint64_t)(uintptr_t)hdr, dst; > + uint64_t iova =3D buf_vec->buf_iova; > + > + while (remain) { > + len =3D RTE_MIN(remain, buf_vec->buf_len); > + dst =3D buf_vec->buf_addr; > + rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, > + len); > + > + remain -=3D len; > + iova +=3D len; > + src +=3D len; > + buf_vec++; > + } > +} > + > +static __rte_always_inline int > +copy_mbuf_to_desc(struct pmd_internal *dev, struct dma_vring *dma_vr, > + struct rte_mbuf *m, struct buf_vector *buf_vec, > + uint16_t nr_vec, uint16_t num_buffers) > +{ > + uint32_t vec_idx =3D 0; > + uint32_t mbuf_offset, mbuf_avail; > + uint32_t buf_offset, buf_avail; > + uint64_t buf_addr, buf_iova, buf_len; > + uint32_t cpy_len; > + uint64_t hdr_addr; > + struct rte_mbuf *hdr_mbuf; > + struct batch_copy_elem *batch_copy =3D dma_vr->batch_copy_elems; > + struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr =3D NULL; > + uint64_t dst, src; > + int error =3D 0; > + > + if (unlikely(m =3D=3D NULL)) { > + error =3D -1; > + goto out; > + } > + > + buf_addr =3D buf_vec[vec_idx].buf_addr; > + buf_iova =3D buf_vec[vec_idx].buf_iova; > + buf_len =3D buf_vec[vec_idx].buf_len; > + > + if (unlikely(buf_len < dev->hdr_len && nr_vec <=3D 1)) { > + error =3D -1; > + goto out; > + } > + > + hdr_mbuf =3D m; > + hdr_addr =3D buf_addr; > + if (unlikely(buf_len < dev->hdr_len)) > + hdr =3D &tmp_hdr; > + else > + hdr =3D (struct virtio_net_hdr_mrg_rxbuf > *)(uintptr_t)hdr_addr; > + > + VHOST_LOG(DEBUG, "(%d) RX: num merge buffers %d\n", dev->vid, > + num_buffers); > + > + if (unlikely(buf_len < dev->hdr_len)) { > + buf_offset =3D dev->hdr_len - buf_len; > + vec_idx++; > + buf_addr =3D buf_vec[vec_idx].buf_addr; > + buf_iova =3D buf_vec[vec_idx].buf_iova; > + buf_len =3D buf_vec[vec_idx].buf_len; > + buf_avail =3D buf_len - buf_offset; > + } else { > + buf_offset =3D dev->hdr_len; > + buf_avail =3D buf_len - dev->hdr_len; > + } > + > + mbuf_avail =3D rte_pktmbuf_data_len(m); > + mbuf_offset =3D 0; > + while (mbuf_avail !=3D 0 || m->next !=3D NULL) { > + bool dma_copy =3D false; > + > + /* done with current buf, get the next one */ > + if (buf_avail =3D=3D 0) { > + vec_idx++; > + if (unlikely(vec_idx >=3D nr_vec)) { > + error =3D -1; > + goto out; > + } > + > + buf_addr =3D buf_vec[vec_idx].buf_addr; > + buf_iova =3D buf_vec[vec_idx].buf_iova; > + buf_len =3D buf_vec[vec_idx].buf_len; > + > + buf_offset =3D 0; > + buf_avail =3D buf_len; > + } > + > + /* done with current mbuf, get the next one */ > + if (mbuf_avail =3D=3D 0) { > + m =3D m->next; > + mbuf_offset =3D 0; > + mbuf_avail =3D rte_pktmbuf_data_len(m); > + } > + > + if (hdr_addr) { > + virtio_enqueue_offload(hdr_mbuf, &hdr->hdr); > + if (rxvq_is_mergeable(dev)) > + ASSIGN_UNLESS_EQUAL(hdr->num_buffers, > + num_buffers); > + > + if (unlikely(hdr =3D=3D &tmp_hdr)) > + copy_vnet_hdr_to_desc(dev, buf_vec, hdr); > + hdr_addr =3D 0; > + } > + > + cpy_len =3D RTE_MIN(buf_avail, mbuf_avail); > + if (cpy_len >=3D DMA_COPY_LENGTH_THRESHOLD) { > + dst =3D gpa_to_hpa(dev, buf_iova + buf_offset, > cpy_len); > + dma_copy =3D (dst !=3D 0); > + } > + > + if (dma_copy) { > + src =3D rte_pktmbuf_iova_offset(m, mbuf_offset); > + /** > + * if DMA enqueue fails, we wait until there are > + * available DMA descriptors. > + */ > + while (unlikely(rte_ioat_enqueue_copy(dma_vr- > >dev_id, > + src, dst, cpy_len, > + (uintptr_t) > + m, 0, 0) =3D=3D > + 0)) { > + int ret; > + > + do { > + ret =3D free_dma_done(dev, dma_vr); > + } while (ret <=3D 0); > + } > + > + dma_vr->nr_batching++; > + dma_vr->nr_inflight++; > + rte_mbuf_refcnt_update(m, 1); > + } else if (likely(cpy_len > MAX_BATCH_LEN || > + dma_vr->batch_copy_nb_elems >=3D > + dma_vr->vr.size)) { > + rte_memcpy((void *)((uintptr_t)(buf_addr + > buf_offset)), > + rte_pktmbuf_mtod_offset(m, void *, > + mbuf_offset), > + cpy_len); > + } else { > + batch_copy[dma_vr->batch_copy_nb_elems].dst =3D > + (void *)((uintptr_t)(buf_addr + buf_offset)); > + batch_copy[dma_vr->batch_copy_nb_elems].src =3D > + rte_pktmbuf_mtod_offset(m, void *, > mbuf_offset); > + batch_copy[dma_vr->batch_copy_nb_elems].len =3D > cpy_len; > + dma_vr->batch_copy_nb_elems++; > + } > + > + mbuf_avail -=3D cpy_len; > + mbuf_offset +=3D cpy_len; > + buf_avail -=3D cpy_len; > + buf_offset +=3D cpy_len; > + } > + > +out: > + return error; > +} > + > +static __rte_always_inline uint16_t > +vhost_dma_enqueue_split(struct pmd_internal *dev, struct dma_vring > *dma_vr, > + struct rte_mbuf **pkts, uint32_t count) > +{ > + struct rte_vhost_vring *vr =3D &dma_vr->vr; > + > + uint32_t pkt_idx =3D 0; > + uint16_t num_buffers; > + struct buf_vector buf_vec[BUF_VECTOR_MAX]; > + uint16_t avail_head; > + > + if (dma_vr->nr_inflight > 0) > + free_dma_done(dev, dma_vr); > + > + avail_head =3D *((volatile uint16_t *)&vr->avail->idx); > + > + /** > + * the ordering between avail index and > + * desc reads needs to be enforced. > + */ > + rte_smp_rmb(); > + > + rte_prefetch0(&vr->avail->ring[dma_vr->last_avail_idx & > + (vr->size - 1)]); > + > + for (pkt_idx =3D 0; pkt_idx < count; pkt_idx++) { > + uint32_t pkt_len =3D pkts[pkt_idx]->pkt_len + dev->hdr_len; > + uint16_t nr_vec =3D 0; > + > + if (unlikely(reserve_avail_buf_split(dev, dma_vr, pkt_len, > + buf_vec, &num_buffers, > + avail_head, &nr_vec) < > + 0)) { > + VHOST_LOG(INFO, > + "(%d) failed to get enough desc from > vring\n", > + dev->vid); > + dma_vr->shadow_used_idx -=3D num_buffers; > + break; > + } > + > + VHOST_LOG(DEBUG, "(%d) current index %d | end > index %d\n", > + dev->vid, dma_vr->last_avail_idx, > + dma_vr->last_avail_idx + num_buffers); > + > + if (copy_mbuf_to_desc(dev, dma_vr, pkts[pkt_idx], > + buf_vec, nr_vec, num_buffers) < 0) { > + dma_vr->shadow_used_idx -=3D num_buffers; > + break; > + } > + > + if (unlikely(dma_vr->nr_batching >=3D DMA_BATCHING_SIZE)) { > + /** > + * kick the DMA to do copy once the number of > + * batching jobs reaches the batching threshold. > + */ > + rte_ioat_do_copies(dma_vr->dev_id); > + dma_vr->nr_batching =3D 0; > + } > + > + dma_vr->last_avail_idx +=3D num_buffers; > + } > + > + do_data_copy(dma_vr); > + > + if (dma_vr->shadow_used_idx) { > + flush_shadow_used_ring_split(dev, dma_vr); > + vhost_dma_vring_call(dev, dma_vr); > + } > + > + if (dma_vr->nr_batching > 0) { > + rte_ioat_do_copies(dma_vr->dev_id); > + dma_vr->nr_batching =3D 0; > + } > + > + return pkt_idx; > +} > + > +uint16_t > +vhost_dma_enqueue_burst(struct pmd_internal *dev, struct dma_vring > *dma_vr, > + struct rte_mbuf **pkts, uint32_t count) > +{ > + return vhost_dma_enqueue_split(dev, dma_vr, pkts, count); > +} > + > int > vhost_dma_setup(struct pmd_internal *dev) > { > @@ -69,6 +793,9 @@ vhost_dma_setup(struct pmd_internal *dev) > dma_vr->used_idx_hpa =3D > rte_mem_virt2iova(&dma_vr->vr.used->idx); >=20 > + dma_vr->max_indices =3D dma_vr->vr.size; > + setup_ring_index(&dma_vr->indices, dma_vr->max_indices); > + > dma_vr->copy_done_used =3D dma_vr->last_used_idx; > dma_vr->signalled_used =3D dma_vr->last_used_idx; > dma_vr->signalled_used_valid =3D false; > @@ -83,6 +810,7 @@ vhost_dma_setup(struct pmd_internal *dev) > dma_vr =3D &dev->dma_vrings[j]; > rte_free(dma_vr->shadow_used_split); > rte_free(dma_vr->batch_copy_elems); > + destroy_ring_index(&dma_vr->indices); > dma_vr->shadow_used_split =3D NULL; > dma_vr->batch_copy_elems =3D NULL; > dma_vr->used_idx_hpa =3D 0; > @@ -104,12 +832,26 @@ vhost_dma_remove(struct pmd_internal *dev) >=20 > for (i =3D 0; i < dev->nr_vrings; i++) { > dma_vr =3D &dev->dma_vrings[i]; > + if (dma_vr->dma_enabled) { > + while (dma_vr->nr_inflight > 0) > + dma_vr->dma_done_fn(dev, dma_vr); > + > + VHOST_LOG(INFO, "Wait for outstanding DMA jobs " > + "of vring %u completion\n", i); > + rte_rawdev_stop(dma_vr->dev_id); > + dma_vr->dma_enabled =3D false; > + dma_vr->nr_batching =3D 0; > + dma_vr->dev_id =3D -1; > + } > + > rte_free(dma_vr->shadow_used_split); > rte_free(dma_vr->batch_copy_elems); > dma_vr->shadow_used_split =3D NULL; > dma_vr->batch_copy_elems =3D NULL; > dma_vr->signalled_used_valid =3D false; > dma_vr->used_idx_hpa =3D 0; > + destroy_ring_index(&dma_vr->indices); > + dma_vr->max_indices =3D 0; > } >=20 > free(dev->mem); > diff --git a/drivers/net/vhost/virtio_net.h b/drivers/net/vhost/virtio_ne= t.h > index 7f99f1d..44a7cdd 100644 > --- a/drivers/net/vhost/virtio_net.h > +++ b/drivers/net/vhost/virtio_net.h > @@ -14,6 +14,89 @@ extern "C" { >=20 > #include "internal.h" >=20 > +#ifndef VIRTIO_F_RING_PACKED > +#define VIRTIO_F_RING_PACKED 34 > +#endif > + > +/* batching size before invoking the DMA to perform transfers */ > +#define DMA_BATCHING_SIZE 8 > +/** > + * copy length threshold for the DMA engine. We offload copy jobs whose > + * lengths are greater than DMA_COPY_LENGTH_THRESHOLD to the DMA; > for > + * small copies, we still use the CPU to perform copies, due to startup > + * overheads associated with the DMA. > + * > + * As DMA copying is asynchronous with CPU computations, we can > + * dynamically increase or decrease the value if the DMA is busier or > + * idler than the CPU. > + */ > +#define DMA_COPY_LENGTH_THRESHOLD 1024 > + > +#define vhost_used_event(vr) \ > + (*(volatile uint16_t*)&(vr)->avail->ring[(vr)->size]) > + > +struct ring_index { > + /* physical address of 'data' */ > + uintptr_t pa; > + uintptr_t idx; > + uint16_t data; > + bool in_use; > +} __rte_cache_aligned; > + > +static __rte_always_inline int > +setup_ring_index(struct ring_index **indices, uint16_t num) > +{ > + struct ring_index *array; > + uint16_t i; > + > + array =3D rte_zmalloc(NULL, sizeof(struct ring_index) * num, 0); > + if (!array) { > + *indices =3D NULL; > + return -1; > + } > + > + for (i =3D 0; i < num; i++) { > + array[i].pa =3D rte_mem_virt2iova(&array[i].data); > + array[i].idx =3D i; > + } > + > + *indices =3D array; > + return 0; > +} > + > +static __rte_always_inline void > +destroy_ring_index(struct ring_index **indices) > +{ > + if (!indices) > + return; > + rte_free(*indices); > + *indices =3D NULL; > +} > + > +static __rte_always_inline struct ring_index * > +get_empty_index(struct ring_index *indices, uint16_t num) > +{ > + uint16_t i; > + > + for (i =3D 0; i < num; i++) > + if (!indices[i].in_use) > + break; > + > + if (unlikely(i =3D=3D num)) > + return NULL; > + > + indices[i].in_use =3D true; > + return &indices[i]; > +} > + > +static __rte_always_inline void > +put_used_index(struct ring_index *indices, uint16_t num, uint16_t idx) > +{ > + if (unlikely(idx >=3D num)) > + return; > + indices[idx].in_use =3D false; > +} > + > static uint64_t > get_blk_size(int fd) > { > @@ -149,6 +232,15 @@ gpa_to_hpa(struct pmd_internal *dev, uint64_t > gpa, uint64_t size) > } >=20 > /** > + * This function checks if packed rings are enabled. > + */ > +static __rte_always_inline bool > +vhost_dma_vring_is_packed(struct pmd_internal *dev) > +{ > + return dev->features & (1ULL << VIRTIO_F_RING_PACKED); > +} > + > +/** > * This function gets front end's memory and vrings information. > * In addition, it sets up necessary data structures for enqueue > * and dequeue operations. > @@ -161,6 +253,34 @@ int vhost_dma_setup(struct pmd_internal *dev); > */ > void vhost_dma_remove(struct pmd_internal *dev); >=20 > +/** > + * This function frees DMA copy-done pktmbufs for the enqueue operation. > + * > + * @return > + * the number of packets that are completed by the DMA engine > + */ > +int free_dma_done(void *dev, void *dma_vr); > + > +/** > + * This function sends packet buffers to front end's RX vring. > + * It will free the mbufs of successfully transmitted packets. > + * > + * @param dev > + * vhost-dma device > + * @param dma_vr > + * a front end's RX vring > + * @param pkts > + * packets to send > + * @param count > + * the number of packets to send > + * > + * @return > + * the number of packets successfully sent > + */ > +uint16_t vhost_dma_enqueue_burst(struct pmd_internal *dev, > + struct dma_vring *dma_vr, > + struct rte_mbuf **pkts, uint32_t count); > + > #ifdef __cplusplus > } > #endif > -- > 2.7.4