From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dispatch1-us1.ppe-hosted.com (dispatch1-us1.ppe-hosted.com [148.163.129.52]) by dpdk.org (Postfix) with ESMTP id DDE731BBBA for ; Wed, 11 Apr 2018 16:03:02 +0200 (CEST) X-Virus-Scanned: Proofpoint Essentials engine Received: from webmail.solarflare.com (uk.solarflare.com [193.34.186.16]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1-us4.ppe-hosted.com (Proofpoint Essentials ESMTP Server) with ESMTPS id AFBB1BC0087; Wed, 11 Apr 2018 14:03:00 +0000 (UTC) Received: from [192.168.38.17] (84.52.114.114) by ukex01.SolarFlarecom.com (10.17.10.4) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Wed, 11 Apr 2018 15:02:53 +0100 To: "Ananyev, Konstantin" , Yongseok Koh CC: Olivier Matz , "Lu, Wenzhuo" , "Wu, Jingjing" , "Adrien Mazarguil" , =?UTF-8?Q?N=c3=a9lio_Laranjeiro?= , "dev@dpdk.org" References: <20180310012532.15809-1-yskoh@mellanox.com> <20180402185008.13073-1-yskoh@mellanox.com> <20180402185008.13073-2-yskoh@mellanox.com> <20180403082615.etnr33cuyey7i3u3@platinum> <20180404001205.GB1867@yongseok-MBP.local> <20180409160434.kmw4iyztemrkzmtc@platinum> <20180410015902.GA20627@yongseok-MBP.local> <2601191342CEEE43887BDE71AB977258AE91344A@IRSMSX102.ger.corp.intel.com> <20180411053302.GA26252@yongseok-MBP.local> <2601191342CEEE43887BDE71AB977258AE913944@IRSMSX102.ger.corp.intel.com> From: Andrew Rybchenko Message-ID: <44a32fef-9087-f0c1-b28f-47b2d6a06995@solarflare.com> Date: Wed, 11 Apr 2018 17:02:50 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <2601191342CEEE43887BDE71AB977258AE913944@IRSMSX102.ger.corp.intel.com> Content-Language: en-GB X-Originating-IP: [84.52.114.114] X-ClientProxiedBy: ocex03.SolarFlarecom.com (10.20.40.36) To ukex01.SolarFlarecom.com (10.17.10.4) X-TM-AS-Product-Ver: SMEX-11.0.0.1191-8.100.1062-23776.003 X-TM-AS-Result: No--28.142400-0.000000-31 X-TM-AS-User-Approved-Sender: Yes X-TM-AS-User-Blocked-Sender: No X-MDID: 1523455381-xWq7ELTiaYdq Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Apr 2018 14:03:03 -0000 On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote: > Hi Yongseok, > >>>> On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote: >>>>> Hi Yongseok, >>>>> >>>>> On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote: >>>>>> On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote: >>>>>>>> When attaching a mbuf, indirect mbuf has to point to start of buffer of >>>>>>>> direct mbuf. By adding buf_off field to rte_mbuf, this becomes more >>>>>>>> flexible. Indirect mbuf can point to any part of direct mbuf by calling >>>>>>>> rte_pktmbuf_attach_at(). >>>>>>>> >>>>>>>> Possible use-cases could be: >>>>>>>> - If a packet has multiple layers of encapsulation, multiple indirect >>>>>>>> buffers can reference different layers of the encapsulated packet. >>>>>>>> - A large direct mbuf can even contain multiple packets in series and >>>>>>>> each packet can be referenced by multiple mbuf indirections. >>>>>>>> >>>>>>>> Signed-off-by: Yongseok Koh >>>>>>> I think the current API is already able to do what you want. >>>>>>> >>>>>>> 1/ Here is a mbuf m with its data >>>>>>> >>>>>>> off >>>>>>> <--> >>>>>>> len >>>>>>> +----+ <----------> >>>>>>> | | >>>>>>> +-|----v----------------------+ >>>>>>> | | -----------------------| >>>>>>> m | buf | XXXXXXXXXXX || >>>>>>> | -----------------------| >>>>>>> +-----------------------------+ >>>>>>> >>>>>>> >>>>>>> 2/ clone m: >>>>>>> >>>>>>> c = rte_pktmbuf_alloc(pool); >>>>>>> rte_pktmbuf_attach(c, m); >>>>>>> >>>>>>> Note that c has its own offset and length fields. >>>>>>> >>>>>>> >>>>>>> off >>>>>>> <--> >>>>>>> len >>>>>>> +----+ <----------> >>>>>>> | | >>>>>>> +-|----v----------------------+ >>>>>>> | | -----------------------| >>>>>>> m | buf | XXXXXXXXXXX || >>>>>>> | -----------------------| >>>>>>> +------^----------------------+ >>>>>>> | >>>>>>> +----+ >>>>>>> indirect | >>>>>>> +-|---------------------------+ >>>>>>> | | -----------------------| >>>>>>> c | buf | || >>>>>>> | -----------------------| >>>>>>> +-----------------------------+ >>>>>>> >>>>>>> off len >>>>>>> <--><----------> >>>>>>> >>>>>>> >>>>>>> 3/ remove some data from c without changing m >>>>>>> >>>>>>> rte_pktmbuf_adj(c, 10) // at head >>>>>>> rte_pktmbuf_trim(c, 10) // at tail >>>>>>> >>>>>>> >>>>>>> Please let me know if it fits your needs. >>>>>> No, it doesn't. >>>>>> >>>>>> Trimming head and tail with the current APIs removes data and make the space >>>>>> available. Adjusting packet head means giving more headroom, not shifting the >>>>>> buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to >>>>>> difference offsets in m, >>>>>> >>>>>> rte_pktmbuf_adj(c1, 10); >>>>>> rte_pktmbuf_adj(c2, 20); >>>>>> >>>>>> then the owner of c2 regard the first (off+20)B as available headroom. If it >>>>>> wants to attach outer header, it will overwrite the headroom even though the >>>>>> owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer >>>>>> header should be linked by h1->next = c2. >>>>> Yes, after these operations c1, c2 and m should become read-only. So, to >>>>> prepend headers, another mbuf has to be inserted before as you suggest. It >>>>> is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset, >>>>> length) that will: >>>>> - alloc and attach indirect mbuf for each segment of m that is >>>>> in the range [offset : length+offset]. >>>>> - prepend an empty and writable mbuf for the headers >>>>> >>>>>> If c1 and c2 are attached with shifting buffer address by adjusting buf_off, >>>>>> which actually shrink the headroom, this case can be properly handled. >>>>> What do you mean by properly handled? >>>>> >>>>> Yes, prepending data or adding data in the indirect mbuf won't override >>>>> the direct mbuf. But prepending data or adding data in the direct mbuf m >>>>> won't be protected. >>>>> >>>>> From an application point of view, indirect mbufs, or direct mbufs that >>>>> have refcnt != 1, should be both considered as read-only because they >>>>> may share their data. How an application can know if the data is shared >>>>> or not? >>>>> >>>>> Maybe we need a flag to differentiate mbufs that are read-only >>>>> (something like SHARED_DATA, or simply READONLY). In your case, if my >>>>> understanding is correct, you want to have indirect mbufs with RW data. >>>> Agree that indirect mbuf must be treated as read-only, Then the current code is >>>> enough to handle that use-case. >>>> >>>>>> And another use-case (this is my actual use-case) is to make a large mbuf have >>>>>> multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs >>>>>> because it transfers multiple packets to a single large buffer to reduce PCIe >>>>>> overhead for small packet traffic like the Multi-Packet Rx of mlx5 does. >>>>>> Otherwise, packets should be memcpy'd to regular mbufs one by one instead of >>>>>> indirect referencing. >>> But just to make HW to RX multiple packets into one mbuf, >>> data_off inside indirect mbuf should be enough, correct? >> Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5 >> to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in >> a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be >> better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order >> to saturate the network link. > There were few complains that 64KB max is a limitation for some use-cases. > I am not against increasing it, but I don't think we have free space on first cache-line for that > without another big rework of mbuf layout. > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too. > >>> As I understand, what you'd like to achieve with this new field - >>> ability to manipulate packet boundaries after RX, probably at upper layer. >>> As Olivier pointed above, that doesn't sound as safe approach - as you have multiple >>> indirect mbufs trying to modify same direct buffer. >> I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1 >> is read-only. What that means, all the entities which own such mbufs have to be >> aware of that and keep the principle as DPDK can't enforce the rule and there >> can't be such sanity check. In this sense, HW doesn't violate it because the >> direct mbuf is injected to HW before indirection. When packets are written by >> HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to >> application layer with freeing the original direct mbuf (decrement refcnt by 1). >> So, HW doesn't touch the direct buffer once it reaches to upper layer. > Yes, I understand that. But as I can see you introduced functions to adjust head and tail, > which implies that it should be possible by some entity (upper layer?) to manipulate these > indirect mbufs. > And we don't know how exactly it will be done. > >> The direct buffer will be freed and get available for reuse when all the attached >> indirect mbufs are freed. >> >>> Though if you really need to do that, why it can be achieved by updating buf_len and priv_size >>> Fields for indirect mbufs, straight after attach()? >> Good point. >> Actually that was my draft (Mellanox internal) version of this patch :-) But I >> had to consider a case where priv_size is really given by user. Even though it >> is less likely, but if original priv_size is quite big, it can't cover entire >> buf_len. For this, I had to increase priv_size to 32-bit but adding another >> 16bit field (buf_off) looked more plausible. > As I remember, we can't have mbufs bigger then 64K, > so priv_size + buf_len should be always less than 64K, correct? It sounds like it is suggested to use/customize priv_size to limit indirect mbuf range in the direct one. It does not work from the box since priv_size is used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()). Andrew.