From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0D7E6A0C4E; Mon, 8 Nov 2021 17:48:29 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 92C0A40E5A; Mon, 8 Nov 2021 17:48:28 +0100 (CET) Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by mails.dpdk.org (Postfix) with ESMTP id 59E8A40E28 for ; Mon, 8 Nov 2021 17:48:27 +0100 (CET) Received: by mail-pl1-f171.google.com with SMTP id v20so16197634plo.7 for ; Mon, 08 Nov 2021 08:48:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=wxc0kjklM+ugjevNr3XSUt4pqEV/hXQRKLebTTTTEW0=; b=hv5+pns0KFYA9dZXNV2p6r9VGJJn/lUbVu/7J9ZlxuUn6ai9qiGeH2wYpQX7vYkJvB uINOBB9gwM4orhRtVzGbvI8VRclxfb0ivcb4OH7UV4lKWyx7TIVYEArr7giULp0rFluP g/b6h5byhXvYwThu63QbHO7qOBzmNPX6ZVuAyyVgLeH4w6vGfdi8ALrgTjo/pFEhCjAK IsH2VqGhIH5GRqyl6oXHGlPCEzhSd1SvV4DgmvWzG550xoWg95+qgq1zCZ1h+gplHvUa t6wHyhQvlN8oyLgd9xhiklC2UH+UdPzCPYVovZX0OXHZID+ssUfh5sydIF0V8bvi1ZKv 6ZRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=wxc0kjklM+ugjevNr3XSUt4pqEV/hXQRKLebTTTTEW0=; b=BLSULo9a6h6Zj86ZATn7jbuVFLm42UqBpjIqwoS/YZaCwnnDyN8DNMViaMs+FOc1Na 9JlqAfb8L4WUve0oAkooaFYF5N8fjR34V+ivZwSwq6fLRBfTtA+eu5Ngqpz2w928TbQD NQA/u7H9t+W7O8A32L4ACcwLjmmBKeFM+uK7Piev75oQB4psxnmBfiUNzeqA27oG//PR QGPWxcvFPFvp3xX58yBJII4JSXPJ4R6Hy20ZG7Bx+W3fRqNfPBWB2dmwkY+xgEGSNkap KM/0ZYtdfOYGVn7aASfKL69Lc9nGKkk/sEQQORw1zSgShBVYN+nKWdaulkQk5aBokGV3 vl6Q== X-Gm-Message-State: AOAM533bBfWhA8h4iWOlchOqyZZaPpGIAl1vS8wqkZnUSjlFrms58wzm qlfPSA0wzWvZtHuDSQZjoUIzYOiIqboH6iDyaw0= X-Google-Smtp-Source: ABdhPJzTo8FDpCjpssN8aPlqt9QzWNoDdBqujWfPojlj+ykvNXbUgA6R03+1DRfEr4SDqPhrFWWS6KEFn9qi3juOmy8= X-Received: by 2002:a17:902:b697:b0:141:c7aa:e10f with SMTP id c23-20020a170902b69700b00141c7aae10fmr644868pls.18.1636390106384; Mon, 08 Nov 2021 08:48:26 -0800 (PST) MIME-Version: 1.0 References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> <98CBD80474FA8B44BF855DF32C47DC35D86C83@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86C9E@smartserver.smartshare.dk> <00330F38-A5D0-4B56-872E-0EA0E24808F7@arm.com> <98CBD80474FA8B44BF855DF32C47DC35D86CA2@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86CAE@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86CB3@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86CB4@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86CB4@smartserver.smartshare.dk> From: Jerin Jacob Date: Mon, 8 Nov 2021 22:17:59 +0530 Message-ID: To: =?UTF-8?Q?Morten_Br=C3=B8rup?= Cc: Honnappa Nagarahalli , Dharmik Thakkar , "Ananyev, Konstantin" , Olivier Matz , Andrew Rybchenko , dpdk-dev , nd , Ruifeng Wang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Mon, Nov 8, 2021 at 9:34 PM Morten Br=C3=B8rup wrote: > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Honnappa > > Nagarahalli > > Sent: Monday, 8 November 2021 16.46 > > > > > > > > > > > > > > > > >>>>>>>>> Current mempool per core cache implementation is > > > > based > > > > > > > > >>>>>>>>> on > > > > > > > > >>>>> pointer > > > > > > > > >>>>>>>>> For most architectures, each pointer consumes 64b > > > > > > > > >>>>>>>>> Replace > > > > > > it > > > > > > > > >>>>> with > > > > > > > > >>>>>>>>> index-based implementation, where in each buffer > > is > > > > > > > > >>>>>>>>> addressed > > > > > > > > >>>>> by > > > > > > > > >>>>>>>>> (pool address + index) > > > > > > > > >>>> > > > > > > > > >>>> I like Dharmik's suggestion very much. CPU cache is a > > > > > > > > >>>> critical and limited resource. > > > > > > > > >>>> > > > > > > > > >>>> DPDK has a tendency of using pointers where indexes > > could > > > > be > > > > > > used > > > > > > > > >>>> instead. I suppose pointers provide the additional > > > > > > > > >>>> flexibility > > > > > > of > > > > > > > > >>>> mixing entries from different memory pools, e.g. > > multiple > > > > > > > > >>>> mbuf > > > > > > > > >> pools. > > > > > > > > >>>> > > > > > > > > >> > > > > > > > > >> Agreed, thank you! > > > > > > > > >> > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> I don't think it is going to work: > > > > > > > > >>>>>>>> On 64-bit systems difference between pool address > > and > > > > > > > > >>>>>>>> it's > > > > > > > > elem > > > > > > > > >>>>>>>> address could be bigger than 4GB. > > > > > > > > >>>>>>> Are you talking about a case where the memory pool > > > > > > > > >>>>>>> size > > > > is > > > > > > > > >>>>>>> more > > > > > > > > >>>>> than 4GB? > > > > > > > > >>>>>> > > > > > > > > >>>>>> That is one possible scenario. > > > > > > > > >>>> > > > > > > > > >>>> That could be solved by making the index an element > > index > > > > > > instead > > > > > > > > of > > > > > > > > >> a > > > > > > > > >>>> pointer offset: address =3D (pool address + index * > > element > > > > > > size). > > > > > > > > >>> > > > > > > > > >>> Or instead of scaling the index with the element size, > > > > which > > > > > > > > >>> is > > > > > > > > only > > > > > > > > >> known at runtime, the index could be more efficiently > > > > > > > > >> scaled > > > > by > > > > > > a > > > > > > > > >> compile time constant such as RTE_MEMPOOL_ALIGN (=3D > > > > > > > > >> RTE_CACHE_LINE_SIZE). With a cache line size of 64 byte, > > > > that > > > > > > would > > > > > > > > >> allow indexing into mempools up to 256 GB in size. > > > > > > > > >>> > > > > > > > > >> > > > > > > > > >> Looking at this snippet [1] from > > > > > > rte_mempool_op_populate_helper(), > > > > > > > > >> there is an =E2=80=98offset=E2=80=99 added to avoid obje= cts to cross > > page > > > > > > > > boundaries. > > > > > > > > >> If my understanding is correct, using the index of > > element > > > > > > instead > > > > > > > > of a > > > > > > > > >> pointer offset will pose a challenge for some of the > > corner > > > > > > cases. > > > > > > > > >> > > > > > > > > >> [1] > > > > > > > > >> for (i =3D 0; i < max_objs; i++) { > > > > > > > > >> /* avoid objects to cross page boundaries > > */ > > > > > > > > >> if (check_obj_bounds(va + off, pg_sz, > > > > > > total_elt_sz) > > > > > > > > >> < > > > > > > > > >> 0) { > > > > > > > > >> off +=3D RTE_PTR_ALIGN_CEIL(va + > > off, > > > > > > pg_sz) - > > > > > > > > >> (va + off); > > > > > > > > >> if (flags & > > > > > > RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ) > > > > > > > > >> off +=3D total_elt_sz - > > > > > > > > >> (((uintptr_t)(va > > + > > > > off - > > > > > > 1) % > > > > > > > > >> > > > > > > > > >> total_elt_sz) > > > > + > > > > > > 1); > > > > > > > > >> } > > > > > > > > >> > > > > > > > > > > > > > > > > > > OK. Alternatively to scaling the index with a cache line > > > > size, > > > > > > you > > > > > > > > can scale it with sizeof(uintptr_t) to be able to address > > 32 > > > > > > > > or > > > > 16 > > > > > > GB > > > > > > > > mempools on respectively 64 bit and 32 bit architectures. > > Both > > > > x86 > > > > > > and > > > > > > > > ARM CPUs have instructions to access memory with an added > > > > offset > > > > > > > > multiplied by 4 or 8. So that should be high performance. > > > > > > > > > > > > > > > > Yes, agreed this can be done. > > > > > > > > Cache line size can also be used when > > > > =E2=80=98MEMPOOL_F_NO_CACHE_ALIGN=E2=80=99 > > > > > > > > is not enabled. > > > > > > > > On a side note, I wanted to better understand the need for > > > > having > > > > > > the > > > > > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' option. > > > > > > > > > > > > > > The description of this field is misleading, and should be > > > > corrected. > > > > > > > The correct description would be: Don't need to align objs on > > > > cache > > > > > > lines. > > > > > > > > > > > > > > It is useful for mempools containing very small objects, to > > > > conserve > > > > > > memory. > > > > > > I think we can assume that mbuf pools are created with the > > > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' flag set. With this we can use > > offset > > > > > > calculated with cache line size as the unit. > > > > > > > > > > You mean MEMPOOL_F_NO_CACHE_ALIGN flag not set. ;-) > > > > Yes =F0=9F=98=8A > > > > > > > > > > > > > > I agree. And since the flag is a hint only, it can be ignored if > > the > > > > mempool > > > > > library is scaling the index with the cache line size. > > > > I do not think we should ignore the flag for reason you mention > > below. > > > > > > > > > > > > > > However, a mempool may contain other objects than mbufs, and > > those > > > > objects > > > > > may be small, so ignoring the MEMPOOL_F_NO_CACHE_ALIGN flag may > > > cost > > > > a > > > > > lot of memory for such mempools. > > > > We could use different methods. If MEMPOOL_F_NO_CACHE_ALIGN is set, > > > > use the unit as 'sizeof(uintptr_t)', if not set use cache line size > > as > > > > the unit. > > > > > > > > > > That would require that the indexing multiplier is a runtime > > parameter instead > > > of a compile time parameter. So it would have a performance penalty. > > > > > > The indexing multiplier could be compile time configurable, so it is > > a tradeoff > > > between granularity and maximum mempool size. > > I meant compile time configurable. i.e. > > > > #ifdef MEMPOOL_F_NO_CACHE_ALIGN > > > > #else > > /* This should provide enough > > memory for packet buffers */ > > #endif > > Please note that MEMPOOL_F_NO_CACHE_ALIGN is a runtime flag passed when c= reating a mempool, not a compile time option. Also, Please share PMU counters stats on L1 and L2 miss with or without this scheme after the rework. IMO, we should not have any regression on 1) Per core mpps OR 2) L1 and L2 misses. with l3fwd/testpmd/l2fwd etc, > >