From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 58768A0C40; Sat, 31 Jul 2021 09:07:24 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id CD1EC4014D; Sat, 31 Jul 2021 09:07:23 +0200 (CEST) Received: from mail-il1-f179.google.com (mail-il1-f179.google.com [209.85.166.179]) by mails.dpdk.org (Postfix) with ESMTP id 87A7140042; Sat, 31 Jul 2021 09:07:22 +0200 (CEST) Received: by mail-il1-f179.google.com with SMTP id j18so8643611ile.8; Sat, 31 Jul 2021 00:07:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mEv2KKOpUUoaukdMotNVcGjXE7JARDmXgpBF3ggNB2o=; b=gQnqd5orL0fWQH6E1tl2GLtk/wPNb+rxOugWVRK4V56yRbN4raugz9TvTSl64MYcst 1lo3G/inO9lxiZQCl/sP6Q/ZSe7VDWLVmHPVqzSCyAGQ/dbnfmWt2DG/NKxv0ebJD0p+ 79yH9QAPaeT0/vi5xqUiGRfF2mZEZT61s8Dl6LCsXRTw3YZJV4XfzMLcvn0KMCAx6vS9 vsX+12Pkk9CqFvqQ7RPLHhcRC25eF6xISInyOo5TZw+OFCS3jc48pVT1KvTx7CJdBsbq k+6Eo42lf2okcdfgB5WrlrS3KB69XXKBJUGMbLpwz7tyaQLbE1JpSk8mT7QuXcKcbiNi Uvxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mEv2KKOpUUoaukdMotNVcGjXE7JARDmXgpBF3ggNB2o=; b=h85VtP5xADu6Zl8sGaPUMneXNf5esqdCNA2c6p4wwOwWpX4PYOfB7n+3WlstHOZO7O +/e9Wc00Xh+AMVYn6/ZxNc4hbaCC8CWfPTYYmmd1Tp/YxE6DwexDIU7wjjGm1KDMN0FS KSkcoyTqJYLCQJCW6auqpv/gEY/649zGIyJxFEG2hie4p+8DT7MySjrz1MBG9C5RqpKV OVYMs4PYsy8JWsg50FmYfhCeAau5aA9+zxFhPyOHz9JW1WdS9C6QlrFLpXxzKeyU1qSE /NzTFrAEtVZ0cRJ16+WlQl8CROQgAA5bwcZplH529m1Clanql02TWcLtewe4NByt/J2R GlSg== X-Gm-Message-State: AOAM533tA/9maM3bVFQ56rnoIgxJdUizjIjAYN0M6rrG8Ah7+NWjmn+P RLlG79H7y5hXvlkzjJQIRq/Bx3XvG7kQ7ldLhjc= X-Google-Smtp-Source: ABdhPJwZljYjiq9we3QKLfpFSIXbJfKl2oTLBDh1i/gfToabh0KoqLkY4uUxn6WEoUbkXDcMK5m8xJA8Lnm8eJIARpw= X-Received: by 2002:a92:c5c5:: with SMTP id s5mr349807ilt.271.1627715241655; Sat, 31 Jul 2021 00:07:21 -0700 (PDT) MIME-Version: 1.0 References: <20210602203531.2288645-1-thomas@monjalon.net> <20210730135533.417611-1-thomas@monjalon.net> In-Reply-To: <20210730135533.417611-1-thomas@monjalon.net> From: Jerin Jacob Date: Sat, 31 Jul 2021 12:36:55 +0530 Message-ID: To: Thomas Monjalon Cc: dpdk-dev , Stephen Hemminger , David Marchand , Andrew Rybchenko , Haiyue Wang , Honnappa Nagarahalli , Jerin Jacob , Ferruh Yigit , techboard@dpdk.org Content-Type: text/plain; charset="UTF-8" Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon wrote: > > From: Elena Agostini > > In heterogeneous computing system, processing is not only in the CPU. > Some tasks can be delegated to devices working in parallel. > > The goal of this new library is to enhance the collaboration between > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > When mixing network activity with task processing on a non-CPU device, > there may be the need to put in communication the CPU with the device > in order to manage the memory, synchronize operations, exchange info, etc.. > > This library provides a number of new features: > - Interoperability with device specific library with generic handlers > - Possibility to allocate and free memory on the device > - Possibility to allocate and free memory on the CPU but visible from the device > - Communication functions to enhance the dialog between the CPU and the device > > The infrastructure is prepared to welcome drivers in drivers/hc/ > as the upcoming NVIDIA one, implementing the hcdev API. > > Some parts are not complete: > - locks > - memory allocation table > - memory freeing > - guide documentation > - integration in devtools/check-doc-vs-code.sh > - unit tests > - integration in testpmd to enable Rx/Tx to/from GPU memory. Since the above line is the crux of the following text, I will start from this point. + Techboard I can give my honest feedback on this. I can map similar stuff in Marvell HW, where we do machine learning as compute offload on a different class of CPU. In terms of RFC patch features 1) memory API - Use cases are aligned 2) communication flag and communication list Our structure is completely different and we are using HW ring kind of interface to post the job to compute interface and the job completion result happens through the event device. Kind of similar to the DMA API that has been discussed on the mailing list. Now the bigger question is why need to Tx and then Rx something to compute the device Isn't ot offload something? If so, why not add the those offload in respective subsystem to improve the subsystem(ethdev, cryptiodev etc) features set to adapt new features or introduce new subsystem (like ML, Inline Baseband processing) so that it will be an opportunity to implement the same in HW or compute device. For example, if we take this path, ML offloading will be application code like testpmd, which deals with "specific" device commands(aka glorified rawdev) to deal with specific computing device offload "COMMANDS" (The commands will be specific to offload device, the same code wont run on other compute device) Just my _personal_ preference is to have specific subsystems to improve the DPDK instead of raw device kind of path. If we decide another path as a community it is _fine_ too(as a _project manager_ point of view it will be an easy path to dump SDK stuff to DPDK without introducing the pain of the subsystem nor improving the DPDK). > > Below is a pseudo-code to give an example about how to use functions > in this library in case of a CUDA application. > > > Elena Agostini (4): > hcdev: introduce heterogeneous computing device library > hcdev: add memory API > hcdev: add communication flag > hcdev: add communication list > > Thomas Monjalon (3): > hcdev: add event notification > hcdev: add child device representing a device context > hcdev: support multi-process > > .gitignore | 1 + > MAINTAINERS | 6 + > doc/api/doxy-api-index.md | 1 + > doc/api/doxy-api.conf.in | 1 + > doc/guides/conf.py | 8 + > doc/guides/hcdevs/features/default.ini | 13 + > doc/guides/hcdevs/index.rst | 11 + > doc/guides/hcdevs/overview.rst | 11 + > doc/guides/index.rst | 1 + > doc/guides/prog_guide/hcdev.rst | 5 + > doc/guides/prog_guide/index.rst | 1 + > doc/guides/rel_notes/release_21_08.rst | 5 + > drivers/hc/meson.build | 4 + > drivers/meson.build | 1 + > lib/hcdev/hcdev.c | 789 +++++++++++++++++++++++++ > lib/hcdev/hcdev_driver.h | 96 +++ > lib/hcdev/meson.build | 12 + > lib/hcdev/rte_hcdev.h | 592 +++++++++++++++++++ > lib/hcdev/version.map | 35 ++ > lib/meson.build | 1 + > 20 files changed, 1594 insertions(+) > create mode 100644 doc/guides/hcdevs/features/default.ini > create mode 100644 doc/guides/hcdevs/index.rst > create mode 100644 doc/guides/hcdevs/overview.rst > create mode 100644 doc/guides/prog_guide/hcdev.rst > create mode 100644 drivers/hc/meson.build > create mode 100644 lib/hcdev/hcdev.c > create mode 100644 lib/hcdev/hcdev_driver.h > create mode 100644 lib/hcdev/meson.build > create mode 100644 lib/hcdev/rte_hcdev.h > create mode 100644 lib/hcdev/version.map > > > > //////////////////////////////////////////////////////////////////////// > ///// HCDEV library + CUDA functions > //////////////////////////////////////////////////////////////////////// > #define GPU_PAGE_SHIFT 16 > #define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT) > > int main() { > struct rte_hcdev_flag quit_flag; > struct rte_hcdev_comm_list *comm_list; > int nb_rx = 0; > int comm_list_entry = 0; > struct rte_mbuf * rx_mbufs[max_rx_mbufs]; > cudaStream_t cstream; > struct rte_mempool *mpool_payload, *mpool_header; > struct rte_pktmbuf_extmem ext_mem; > int16_t dev_id; > > /* Initialize CUDA objects (cstream, context, etc..). */ > /* Use hcdev library to register a new CUDA context if any */ > /* Let's assume the application wants to use the default context of the GPU device 0 */ > dev_id = 0; > > /* Create an external memory mempool using memory allocated on the GPU. */ > ext_mem.elt_size = mbufs_headroom_size; > ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE); > ext_mem.buf_iova = RTE_BAD_IOVA; > ext_mem.buf_ptr = rte_hcdev_malloc(dev_id, ext_mem.buf_len, 0); > rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE); > rte_dev_dma_map(rte_eth_devices[l2fwd_port_id].device, ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len); > mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num, > 0, 0, ext_mem.elt_size, > rte_socket_id(), &ext_mem, 1); > > /* > * Create CPU - device communication flag. With this flag, the CPU can tell to the CUDA kernel > * to exit from the main loop. > */ > rte_hcdev_comm_create_flag(dev_id, &quit_flag, RTE_HCDEV_COMM_FLAG_CPU); > rte_hcdev_comm_set_flag(&quit_flag, 0); > > /* > * Create CPU - device communication list. Each entry of this list will be populated by the CPU > * with a new set of received mbufs that the CUDA kernel has to process. > */ > comm_list = rte_hcdev_comm_create_list(dev_id, num_entries); > > /* A very simple CUDA kernel with just 1 CUDA block and RTE_HCDEV_COMM_LIST_PKTS_MAX CUDA threads. */ > cuda_kernel_packet_processing<<<1, RTE_HCDEV_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...); > > /* > * For simplicity, the CPU here receives only 2 bursts of mbufs. > * In a real application, network activity and device processing should overlap. > */ > nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs); > rte_hcdev_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx); > nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs); > rte_hcdev_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx); > > /* > * CPU waits for the completion of the packets' processing on the CUDA kernel > * and then it does a cleanup of the received mbufs. > */ > while (rte_hcdev_comm_cleanup_list(comm_list[0])); > while (rte_hcdev_comm_cleanup_list(comm_list[1])); > > /* CPU notifies the CUDA kernel that it has to terminate */ > rte_hcdev_comm_set_flag(&quit_flag, 1); > > /* hcdev objects cleanup/destruction */ > /* CUDA cleanup */ > /* DPDK cleanup */ > > return 0; > } > > //////////////////////////////////////////////////////////////////////// > ///// CUDA kernel > //////////////////////////////////////////////////////////////////////// > > void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_hcdev_comm_list *comm_list, int comm_list_entries) { > int comm_list_index = 0; > struct rte_hcdev_comm_pkt *pkt_list = NULL; > > /* Do some pre-processing operations. */ > > /* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */ > while (*quit_flag_ptr == 0) > { > if (comm_list[comm_list_index]->status != RTE_HCDEV_COMM_LIST_READY) > continue; > > if (threadIdx.x < comm_list[comm_list_index]->num_pkts) > { > /* Each CUDA thread processes a different packet. */ > packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..); > } > __threadfence(); > __syncthreads(); > > /* Wait for new packets on the next communication list entry. */ > comm_list_index = (comm_list_index+1) % comm_list_entries; > } > > /* Do some post-processing operations. */ > } > > > -- > 2.31.1 >