From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8A384A0032; Fri, 14 Jan 2022 10:11:15 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1DA8E42772; Fri, 14 Jan 2022 10:11:15 +0100 (CET) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by mails.dpdk.org (Postfix) with ESMTP id 9969340DDD for ; Fri, 14 Jan 2022 10:11:13 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642151473; x=1673687473; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=Z7vQbbHffrN6359MP+Fta4L2u3FnB0q+R1XLhx4Ymz8=; b=ivlYYtaZjdMoZLj+6phKm8l2x8diu280q6N8GAjK3fNGGwtlXtrc2Cqn a0H8pQZmoZ0e7t4/kEHZA3gs9L3+087Mbq+ijl8923aGOXbcoS+kbRIQC 3VTVpwBmVNNfmulD7AT/fB6lvjm5kAbE0aRdpFPf+lcnOpmObBHuceAdo Fd1iozIZXGtnvAfHdrsr7B/T2fqMLDOC6nKtX4TJQgTbPItvnDY/Y4Cnv 3HpRwKltn5vFM/SVkIdPntirOkuBKM3Jlv9fWzj75PzIEgmHXIVczimvW 4eixXMNMHr2C+jYrQHeALdfOFVzjr9I1GjM5AUwoQaYdyOLkme2p3dOm8 g==; X-IronPort-AV: E=McAfee;i="6200,9189,10226"; a="244414562" X-IronPort-AV: E=Sophos;i="5.88,288,1635231600"; d="scan'208";a="244414562" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jan 2022 01:11:12 -0800 X-IronPort-AV: E=Sophos;i="5.88,288,1635231600"; d="scan'208";a="559414034" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.26.25]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 14 Jan 2022 01:11:10 -0800 Date: Fri, 14 Jan 2022 09:11:07 +0000 From: Bruce Richardson To: Morten =?iso-8859-1?Q?Br=F8rup?= Cc: Jan Viktorin , Ruifeng Wang , David Christensen , Konstantin Ananyev , dev@dpdk.org Subject: Re: rte_memcpy alignment Message-ID: References: <98CBD80474FA8B44BF855DF32C47DC35D86E00@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86E00@smartserver.smartshare.dk> X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote: > Dear ARM/POWER/x86 maintainers, > > The architecture specific rte_memcpy() provides optimized variants to copy aligned data. However, the alignment requirements depend on the hardware architecture, and there is no common definition for the alignment. > > DPDK provides __rte_cache_aligned for cache optimization purposes, with architecture specific values. Would you consider providing an __rte_memcpy_aligned for rte_memcpy() optimization purposes? > > Or should I just use __rte_cache_aligned, although it is overkill? > > > Specifically, I am working on a mempool optimization where the objs field in the rte_mempool_cache structure may benefit by being aligned for optimized rte_memcpy(). > For me the difficulty with such a memcpy proposal - apart from probably adding to the amount of memcpy code we have to maintain - is the specific meaning of what "aligned" in the memcpy case. Unlike for a struct definition, the possible meaning of aligned in memcpy could be: * the source address is aligned * the destination address is aligned * both source and destination is aligned * both source and destination are aligned and the copy length is a multiple of the alignment length * the data is aligned to a cacheline boundary * the data is aligned to the largest load-store size for system * the data is aligned to the boundary suitable for the copy size, e.g. memcpy of 8 bytes is 8-byte aligned etc. Can you clarify a bit more on your own thinking here? Personally, I am a little dubious of the benefit of general memcpy optimization, but I do believe that for specific usecases there is value is having their own copy operations which include constraints for that specific usecase. For example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from the mempool cache into the descriptor rearm function because we know we can always do 64-byte loads and stores, and also because we know that for each load in the copy, we can reuse the data just after storing it (giving good perf boost). Perhaps something similar could work for you in your mempool optimization. /Bruce