From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id C21DBA04B7; Wed, 14 Oct 2020 00:34:27 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 32B111DB1A; Wed, 14 Oct 2020 00:34:26 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 989481DB08 for ; Wed, 14 Oct 2020 00:34:24 +0200 (CEST) IronPort-SDR: ny6+yvf67LViQERjYYAmkdw9r6CiAMUMFCAYWIO5zKSY1gK/m5nApyfag3ng2FrfBzwgQ30AxP QReBvGuEsTYA== X-IronPort-AV: E=McAfee;i="6000,8403,9773"; a="183489403" X-IronPort-AV: E=Sophos;i="5.77,372,1596524400"; d="scan'208";a="183489403" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Oct 2020 15:34:22 -0700 IronPort-SDR: tAsT40f7t2KzUfF2njyPgvtPDH1XCEK5K7o0J5xrkt4DMvfgWo0NmMuHzJIZ17Bwj3/qk4sGHw M0tuPvWxR+Gw== X-IronPort-AV: E=Sophos;i="5.77,372,1596524400"; d="scan'208";a="530589471" Received: from fyigit-mobl1.ger.corp.intel.com (HELO [10.213.214.64]) ([10.213.214.64]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Oct 2020 15:34:20 -0700 To: Viacheslav Ovsiienko , dev@dpdk.org Cc: thomasm@monjalon.net, stephen@networkplumber.org, olivier.matz@6wind.com, jerinjacobk@gmail.com, maxime.coquelin@redhat.com, david.marchand@redhat.com, arybchenko@solarflare.com References: <1602616917-22193-1-git-send-email-viacheslavo@nvidia.com> <1602616917-22193-2-git-send-email-viacheslavo@nvidia.com> From: Ferruh Yigit Message-ID: Date: Tue, 13 Oct 2020 23:34:16 +0100 MIME-Version: 1.0 In-Reply-To: <1602616917-22193-2-git-send-email-viacheslavo@nvidia.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH v5 1/6] ethdev: introduce Rx buffer split X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 10/13/2020 8:21 PM, Viacheslav Ovsiienko wrote: > The DPDK datapath in the transmit direction is very flexible. > An application can build the multi-segment packet and manages > almost all data aspects - the memory pools where segments > are allocated from, the segment lengths, the memory attributes > like external buffers, registered for DMA, etc. > > In the receiving direction, the datapath is much less flexible, > an application can only specify the memory pool to configure the > receiving queue and nothing more. In order to extend receiving > datapath capabilities it is proposed to add the way to provide > extended information how to split the packets being received. > > The following structure is introduced to specify the Rx packet > segment: > > struct rte_eth_rxseg { > struct rte_mempool *mp; /* memory pools to allocate segment from */ > uint16_t length; /* segment maximal data length, > configures "split point" */ > uint16_t offset; /* data offset from beginning > of mbuf data buffer */ > uint32_t reserved; /* reserved field */ > }; > > The segment descriptions are added to the rte_eth_rxconf structure: > rx_seg - pointer the array of segment descriptions, each element > describes the memory pool, maximal data length, initial > data offset from the beginning of data buffer in mbuf. > This array allows to specify the different settings for > each segment in individual fashion. > n_seg - number of elements in the array > > If the extended segment descriptions is provided with these new > fields the mp parameter of the rte_eth_rx_queue_setup must be > specified as NULL to avoid ambiguity. > > The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device > capabilities is introduced to present the way for PMD to report to > application about supporting Rx packet split to configurable > segments. Prior invoking the rte_eth_rx_queue_setup() routine > application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag. > > If the Rx queue is configured with new routine the packets being > received will be split into multiple segments pushed to the mbufs > with specified attributes. The PMD will split the received packets > into multiple segments according to the specification in the > description array: > > - the first network buffer will be allocated from the memory pool, > specified in the first segment description element, the second > network buffer - from the pool in the second segment description > element and so on. If there is no enough elements to describe > the buffer for entire packet of maximal length the pool from the > last valid element will be used to allocate the buffers from for the > rest of segments > > - the offsets from the segment description elements will provide > the data offset from the buffer beginning except the first mbuf - > for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get > actual offset from the buffer beginning. If there is no enough > elements to describe the buffer for entire packet of maximal length > the offsets for the rest of segment will be supposed to be zero. > > - the data length being received to each segment is limited by the > length specified in the segment description element. The data > receiving starts with filling up the first mbuf data buffer, if the > specified maximal segment length is reached and there are data > remaining (packet is longer than buffer in the first mbuf) the > following data will be pushed to the next segment up to its own > maximal length. If the first two segments is not enough to store > all the packet remaining data the next (third) segment will > be engaged and so on. If the length in the segment description > element is zero the actual buffer size will be deduced from > the appropriate memory pool properties. If there is no enough > elements to describe the buffer for entire packet of maximal > length the buffer size will be deduced from the pool of the last > valid element for the remaining segments. > > For example, let's suppose we configured the Rx queue with the > following segments: > seg0 - pool0, len0=14B, off0=2 > seg1 - pool1, len1=20B, off1=128B > seg2 - pool2, len2=20B, off2=0B > seg3 - pool3, len3=512B, off3=0B > > The packet 46 bytes long will look like the following: > seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0 > seg1 - 20B long @ 128 in mbuf from pool1 > seg2 - 12B long @ 0 in mbuf from pool2 > > The packet 1500 bytes long will look like the following: > seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0 > seg1 - 20B @ 128 in mbuf from pool1 > seg2 - 20B @ 0 in mbuf from pool2 > seg3 - 512B @ 0 in mbuf from pool3 > seg4 - 512B @ 0 in mbuf from pool3 > seg5 - 422B @ 0 in mbuf from pool3 > > The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and > configured to support new buffer split feature (if n_seg > is greater than one). > > The new approach would allow splitting the ingress packets into > multiple parts pushed to the memory with different attributes. > For example, the packet headers can be pushed to the embedded > data buffers within mbufs and the application data into > the external buffers attached to mbufs allocated from the > different memory pools. The memory attributes for the split > parts may differ either - for example the application data > may be pushed into the external memory located on the dedicated > physical device, say GPU or NVMe. This would improve the DPDK > receiving datapath flexibility with preserving compatibility > with existing API. > > Signed-off-by: Viacheslav Ovsiienko > --- > doc/guides/nics/features.rst | 15 +++++ > doc/guides/rel_notes/release_20_11.rst | 9 +++ > lib/librte_ethdev/rte_ethdev.c | 95 ++++++++++++++++++++++++-------- > lib/librte_ethdev/rte_ethdev.h | 58 ++++++++++++++++++- > lib/librte_ethdev/rte_ethdev_version.map | 1 + Can you please update deprecation notice too, to remove the notice? > 5 files changed, 155 insertions(+), 23 deletions(-) > > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst > index dd8c955..a45a9e8 100644 > --- a/doc/guides/nics/features.rst > +++ b/doc/guides/nics/features.rst > @@ -185,6 +185,21 @@ Supports receiving segmented mbufs. > * **[related] eth_dev_ops**: ``rx_pkt_burst``. > > > +.. _nic_features_buffer_split: > + > +Buffer Split on Rx > +------------------ > + > +Scatters the packets being received on specified boundaries to segmented mbufs. > + > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``. uses 'rx_seg and rx_nseg' from 'rte_eth_rxconf', perhaps it can be another line. > +* **[implements] datapath**: ``Buffer Split functionality``. > +* **[implements] rte_eth_dev_data**: ``buffer_split``. What is implemented here? > +* **[provides] rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``. > +* **[provides] eth_dev_ops**: ``rxq_info_get:buffer_split``. Is this correct? > +* **[related] API**: ``rte_eth_rx_queue_setup()``. > + > + > .. _nic_features_lro: > > LRO > diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst > index bcc0fc2..f67ec62 100644 > --- a/doc/guides/rel_notes/release_20_11.rst > +++ b/doc/guides/rel_notes/release_20_11.rst > @@ -60,6 +60,12 @@ New Features > Added the FEC API which provides functions for query FEC capabilities and > current FEC mode from device. Also, API for configuring FEC mode is also provided. > > +* **Introduced extended buffer description for receiving.** > + > + Added the extended Rx queue setup routine providing the individual This looks wrong with last version. > + descriptions for each Rx segment with maximal size, buffer offset and memory > + pool to allocate data buffers from. > + > * **Updated Broadcom bnxt driver.** > > Updated the Broadcom bnxt driver with new features and improvements, including: > @@ -253,6 +259,9 @@ API Changes > As the data of ``uint8_t`` will be truncated when queue number under > a TC is greater than 256. > > +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure > + to provide extended description of the receiving buffer. > + > * vhost: Moved vDPA APIs from experimental to stable. > > * rawdev: Added a structure size parameter to the functions > diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c > index 892c246..7b64a6e 100644 > --- a/lib/librte_ethdev/rte_ethdev.c > +++ b/lib/librte_ethdev/rte_ethdev.c > @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off { > #define RTE_RX_OFFLOAD_BIT2STR(_name) \ > { DEV_RX_OFFLOAD_##_name, #_name } > > +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name) \ > + { RTE_ETH_RX_OFFLOAD_##_name, #_name } > + > static const struct { > uint64_t offload; > const char *name; > @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off { > RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM), > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM), > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH), > + RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT), > }; > > #undef RTE_RX_OFFLOAD_BIT2STR > +#undef RTE_ETH_RX_OFFLOAD_BIT2STR > > #define RTE_TX_OFFLOAD_BIT2STR(_name) \ > { DEV_TX_OFFLOAD_##_name, #_name } > @@ -1770,10 +1775,14 @@ struct rte_eth_dev * > struct rte_mempool *mp) > { > int ret; > + uint16_t seg_idx; > uint32_t mbp_buf_size; > struct rte_eth_dev *dev; > struct rte_eth_dev_info dev_info; > struct rte_eth_rxconf local_conf; > + const struct rte_eth_rxseg *rx_seg; > + struct rte_eth_rxseg seg_single = { .mp = mp}; > + uint16_t n_seg; > void **rxq; > > RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); > @@ -1784,13 +1793,32 @@ struct rte_eth_dev * > return -EINVAL; > } > > - if (mp == NULL) { > - RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n"); > - return -EINVAL; > - } > - > RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP); > > + rx_seg = rx_conf->rx_seg; > + n_seg = rx_conf->rx_nseg; > + if (rx_seg == NULL) { > + /* Exclude ambiguities about segment descrtiptions. */ > + if (n_seg) { > + RTE_ETHDEV_LOG(ERR, > + "Non empty array with null pointer\n"); > + return -EINVAL; > + } > + rx_seg = &seg_single; > + n_seg = 1; Why setting 'rx_seg' & 'n_seg'? Why not leaving them NULL and 0 when not used? This was PMD can do NULL/0 check and can know they are not used. I think better to do a "if (mp == NULL)" check here, both 'rx_seg' & 'mp' should not be NULL. > + } else { > + if (n_seg == 0) { > + RTE_ETHDEV_LOG(ERR, > + "Invalid zero descriptions number\n"); > + return -EINVAL; > + } > + if (mp) { > + RTE_ETHDEV_LOG(ERR, > + "Memory pool duplicated definition\n"); > + return -EINVAL; > + } > + } > + > /* > * Check the size of the mbuf data buffer. > * This value must be provided in the private data of the memory pool. > @@ -1800,23 +1828,48 @@ struct rte_eth_dev * > if (ret != 0) > return ret; > > - if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) { > - RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n", > - mp->name, (int)mp->private_data_size, > - (int)sizeof(struct rte_pktmbuf_pool_private)); > - return -ENOSPC; > - } > - mbp_buf_size = rte_pktmbuf_data_room_size(mp); > + for (seg_idx = 0; seg_idx < n_seg; seg_idx++) { > + struct rte_mempool *mpl = rx_seg[seg_idx].mp; > + uint32_t length = rx_seg[seg_idx].length; > + uint32_t offset = rx_seg[seg_idx].offset; > + uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM; > > - if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) { > - RTE_ETHDEV_LOG(ERR, > - "%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n", > - mp->name, (int)mbp_buf_size, > - (int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize), > - (int)RTE_PKTMBUF_HEADROOM, > - (int)dev_info.min_rx_bufsize); > - return -EINVAL; > + if (mpl == NULL) { > + RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n"); > + return -EINVAL; > + } > + > + if (mpl->private_data_size < > + sizeof(struct rte_pktmbuf_pool_private)) { > + RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n", > + mpl->name, (int)mpl->private_data_size, > + (int)sizeof(struct rte_pktmbuf_pool_private)); > + return -ENOSPC; > + } > + > + mbp_buf_size = rte_pktmbuf_data_room_size(mpl); > + length = length ? length : (mbp_buf_size - head_room); > + if (mbp_buf_size < length + offset + head_room) { > + RTE_ETHDEV_LOG(ERR, > + "%s mbuf_data_room_size %u < %u" > + " (segment length=%u + segment offset=%u)\n", > + mpl->name, mbp_buf_size, > + length + offset, length, offset); > + return -EINVAL; > + } > } > + /* Check the minimal buffer size for the single segment only. */ This is the main branch, what do you think moving the comment to the beggining of above for loop and add a comment about testing the multiple segment. Btw, I have a concern that this single/multi segment can cause a confusion with multi segment packets. Can something else, like "split package" can be used instead of segment? > + if (mp && (mbp_buf_size < dev_info.min_rx_bufsize + > + RTE_PKTMBUF_HEADROOM)) { > + RTE_ETHDEV_LOG(ERR, > + "%s mbuf_data_room_size %u < %u " > + "(RTE_PKTMBUF_HEADROOM=%u + " > + "min_rx_bufsize(dev)=%u)\n", > + mp->name, mbp_buf_size, > + RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize, > + RTE_PKTMBUF_HEADROOM, dev_info.min_rx_bufsize); > + return -EINVAL; > + } > > /* Use default specified by driver, if nb_rx_desc is zero */ > if (nb_rx_desc == 0) { > @@ -1914,8 +1967,6 @@ struct rte_eth_dev * > dev->data->min_rx_buf_size = mbp_buf_size; > } > > - rte_ethdev_trace_rxq_setup(port_id, rx_queue_id, nb_rx_desc, mp, > - rx_conf, ret); Is this removed intentionally? > return eth_err(port_id, ret); > } > > diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h > index 5bcfbb8..9cf0a03 100644 > --- a/lib/librte_ethdev/rte_ethdev.h > +++ b/lib/librte_ethdev/rte_ethdev.h > @@ -970,6 +970,16 @@ struct rte_eth_txmode { > }; > > /** > + * A structure used to configure an RX packet segment to split. > + */ > +struct rte_eth_rxseg { > + struct rte_mempool *mp; /**< Memory pools to allocate segment from */ > + uint16_t length; /**< Segment data length, configures split point. */ > + uint16_t offset; /**< Data offset from beginning of mbuf data buffer */ > + uint32_t reserved; /**< Reserved field */ > +}; > + > +/** > * A structure used to configure an RX ring of an Ethernet port. > */ > struct rte_eth_rxconf { > @@ -977,13 +987,23 @@ struct rte_eth_rxconf { > uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */ > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */ > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */ > + uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */ > + /** > + * The pointer to the array of segment descriptions, each element > + * describes the memory pool, maximal segment data length, initial > + * data offset from the beginning of data buffer in mbuf. This allow > + * to specify the dedicated properties for each segment in the receiving > + * buffer - pool, buffer offset, maximal segment size. The number of > + * segment descriptions in the array is specified by the rx_nseg > + * field. > + */ What do you think providing a short description here, and move above comment to abice "struct rte_eth_rxseg" struct? > + struct rte_eth_rxseg *rx_seg; > /** > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags. > * Only offloads set on rx_queue_offload_capa or rx_offload_capa > * fields on rte_eth_dev_info structure are allowed to be set. > */ > uint64_t offloads; > - unrelated > uint64_t reserved_64s[2]; /**< Reserved for future fields */ > void *reserved_ptrs[2]; /**< Reserved for future fields */ > }; > @@ -1260,6 +1280,7 @@ struct rte_eth_conf { > #define DEV_RX_OFFLOAD_SCTP_CKSUM 0x00020000 > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000 > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000 > +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000 > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \ > DEV_RX_OFFLOAD_UDP_CKSUM | \ > @@ -2027,6 +2048,41 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue, > * No need to repeat any bit in rx_conf->offloads which has already been > * enabled in rte_eth_dev_configure() at port level. An offloading enabled > * at port level can't be disabled at queue level. Can it be possible to put a kind of marker here, like "@rx_seg & @rx_nseg", to clarify what are you talking about. > + * The configuration structure also contains the pointer to the array > + * of the receiving buffer segment descriptions, each element describes > + * the memory pool, maximal segment data length, initial data offset from > + * the beginning of data buffer in mbuf. This allow to specify the dedicated > + * properties for each segment in the receiving buffer - pool, buffer > + * offset, maximal segment size. If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload > + * flag is configured the PMD will split the received packets into multiple > + * segments according to the specification in the description array: > + * - the first network buffer will be allocated from the memory pool, > + * specified in the first segment description element, the second > + * network buffer - from the pool in the second segment description > + * element and so on. If there is no enough elements to describe > + * the buffer for entire packet of maximal length the pool from the last > + * valid element will be used to allocate the buffers from for the rest > + * of segments. > + * - the offsets from the segment description elements will provide the > + * data offset from the buffer beginning except the first mbuf - for this > + * one the offset is added to the RTE_PKTMBUF_HEADROOM to get actual > + * offset from the buffer beginning. If there is no enough elements > + * to describe the buffer for entire packet of maximal length the offsets > + * for the rest of segment will be supposed to be zero. > + * - the data length being received to each segment is limited by the > + * length specified in the segment description element. The data receiving > + * starts with filling up the first mbuf data buffer, if the specified > + * maximal segment length is reached and there are data remaining > + * (packet is longer than buffer in the first mbuf) the following data > + * will be pushed to the next segment up to its own length. If the first > + * two segments is not enough to store all the packet data the next > + * (third) segment will be engaged and so on. If the length in the segment > + * description element is zero the actual buffer size will be deduced > + * from the appropriate memory pool properties. If there is no enough > + * elements to describe the buffer for entire packet of maximal length > + * the buffer size will be deduced from the pool of the last valid > + * element for the all remaining segments. > + * I think as a first thing the comment should clarify that if @rx_seg provided 'mb_pool' should be NULL, and if split Rx feature is not used "@rx_seg & @rx_nseg" should be NULL and 0. Also above is too wordy, it is hard to follow. Like "@rx_seg & @rx_nseg" are only taken into account if application provides 'RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT' offload should be clearer to see, etc. Can you try to simplify it, perhpas moving some of above comments to the "struct rte_eth_rxseg" can work? > * @param mb_pool > * The pointer to the memory pool from which to allocate *rte_mbuf* network > * memory buffers to populate each descriptor of the receive ring. > diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map > index f8a0945..25f7cee 100644 > --- a/lib/librte_ethdev/rte_ethdev_version.map > +++ b/lib/librte_ethdev/rte_ethdev_version.map > @@ -232,6 +232,7 @@ EXPERIMENTAL { > rte_eth_fec_get_capability; > rte_eth_fec_get; > rte_eth_fec_set; > + unrelated