From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id 037B8530F for ; Mon, 24 Oct 2016 18:25:42 +0200 (CEST) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP; 24 Oct 2016 09:25:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,542,1473145200"; d="scan'208";a="893380104" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.210.150]) by orsmga003.jf.intel.com with SMTP; 24 Oct 2016 09:25:39 -0700 Received: by (sSMTP sendmail emulation); Mon, 24 Oct 2016 17:25:38 +0100 Date: Mon, 24 Oct 2016 17:25:38 +0100 From: Bruce Richardson To: "Wiles, Keith" Cc: Morten =?iso-8859-1?Q?Br=F8rup?= , "dev@dpdk.org" , Olivier Matz Message-ID: <20161024162538.GA34988@bricha3-MOBL3.ger.corp.intel.com> References: <98CBD80474FA8B44BF855DF32C47DC359EA8B1@smartserver.smartshare.dk> <7910CF2F-7087-4307-A9AC-DE0287104185@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7910CF2F-7087-4307-A9AC-DE0287104185@intel.com> Organization: Intel Research and =?iso-8859-1?Q?De=ACvel?= =?iso-8859-1?Q?opment?= Ireland Ltd. User-Agent: Mutt/1.7.1 (2016-10-04) Subject: Re: [dpdk-dev] mbuf changes X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Oct 2016 16:25:43 -0000 On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote: > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup wrote: > > > > First of all: Thanks for a great DPDK Userspace 2016! > > > > > > > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes... Thanks for keeping the discussion going! > > > > > > > > 1. > > > > Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free(). > > Thinking about it, I suspect there are more fields we can reset on free to save time on alloc. Refcnt, as discussed below is one of them, but so too could be the nb_segs field and possibly others. > > > > > > 2. > > > > There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding. > > > > Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits. > > I thought we also talked about removing the m->port from the mbuf as it is not really needed. > Yes, this was mentioned, and also the option of moving the port value to the second cacheline, but it appears that NXP are using the port value in their NIC drivers for passing in metadata, so we'd need their agreement on any move (or removal). > > > > > > > > 3. > > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to the second cache line, which then generated questions from the audience about the real life purpose of m->port, and if m->port could be removed from the mbuf structure. > > > > > > > > 4. > > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on first allocation. This is based on the assumption that other mbuf fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper than setting it to 1. > > > > Furthermore (regardless of m->refcnt offset), I suggested that it is not required to modify m->refcnt when allocating and freeing the mbuf, thus saving one write operation on both alloc() and free(). However, this assumes that m->refcnt debugging, e.g. underrun detection, is not required. I don't think it really matters what sentinal value is used for the refcnt because it can't be blindly assigned on free like other fields. However, I think 0 as first reference value becomes more awkward than 1, because we need to deal with underflow. Consider the situation where we have two references to the mbuf, so refcnt is 1, and both are freed at the same time. Since the refcnt is not-zero, then both cores will do an atomic decrement simultaneously giving a refcnt of -1. We can then set this back to zero before freeing, however, I'd still prefer to have refcnt be an accurate value so that it always stays positive, and we can still set it to "one" on free to avoid having to set on alloc. Also, if we set refcnt on free rather than alloc, it does set itself up as a good candidate for moving to the second cacheline. Fast-path processing does not normally update the value. > > > > > > > > 5. > > > > And here’s something new to think about: > > > > m->next already reveals if there are more segments to a packet. Which purpose does m->nb_segs serve that is not already covered by m->next? It is duplicate info, but nb_segs can be used to check the validity of the next pointer without having to read the second mbuf cacheline. Whether it's worth having is something I'm happy enough to discuss, though. One other point I'll mention is that we need to have a discussion on how/where to add in a timestamp value into the mbuf. Personally, I think it can be in a union with the sequence number value, but I also suspect that 32-bits of a timestamp is not going to be enough for many. Thoughts? /Bruce