From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 489E71BAAC for ; Wed, 11 Apr 2018 13:41:08 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 11 Apr 2018 04:39:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,435,1517904000"; d="scan'208";a="49901516" Received: from irsmsx110.ger.corp.intel.com ([163.33.3.25]) by orsmga002.jf.intel.com with ESMTP; 11 Apr 2018 04:39:49 -0700 Received: from irsmsx102.ger.corp.intel.com ([169.254.2.164]) by irsmsx110.ger.corp.intel.com ([169.254.15.211]) with mapi id 14.03.0319.002; Wed, 11 Apr 2018 12:39:48 +0100 From: "Ananyev, Konstantin" To: Yongseok Koh CC: Olivier Matz , "Lu, Wenzhuo" , "Wu, Jingjing" , "Adrien Mazarguil" , =?iso-8859-1?Q?N=E9lio_Laranjeiro?= , "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Thread-Index: AQHTyyWFk8vygk5Jwk2G86W8WINYuKPvrBIAgAjlxgCAAKYZgIABhjNggABH7QCAAG6g8A== Date: Wed, 11 Apr 2018 11:39:47 +0000 Message-ID: <2601191342CEEE43887BDE71AB977258AE913944@IRSMSX102.ger.corp.intel.com> References: <20180310012532.15809-1-yskoh@mellanox.com> <20180402185008.13073-1-yskoh@mellanox.com> <20180402185008.13073-2-yskoh@mellanox.com> <20180403082615.etnr33cuyey7i3u3@platinum> <20180404001205.GB1867@yongseok-MBP.local> <20180409160434.kmw4iyztemrkzmtc@platinum> <20180410015902.GA20627@yongseok-MBP.local> <2601191342CEEE43887BDE71AB977258AE91344A@IRSMSX102.ger.corp.intel.com> <20180411053302.GA26252@yongseok-MBP.local> In-Reply-To: <20180411053302.GA26252@yongseok-MBP.local> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYjgyOGY4NDEtNzdjMy00ZTEwLTg4ZWUtNjgyYTVjNjZmNGI1IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IkhDRW5TZnZXXC9YRDBRQ1hMTWtVXC9rSm1iTlZYSTJRMyszZGFmZWFqSjlubz0ifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.200.100 dlp-reaction: no-action x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Apr 2018 11:41:09 -0000 Hi Yongseok, > > > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote: > > > > Hi Yongseok, > > > > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote: > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote: > > > > > > Hi, > > > > > > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote: > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of= buffer of > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this become= s more > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf = by calling > > > > > > > rte_pktmbuf_attach_at(). > > > > > > > > > > > > > > Possible use-cases could be: > > > > > > > - If a packet has multiple layers of encapsulation, multiple = indirect > > > > > > > buffers can reference different layers of the encapsulated = packet. > > > > > > > - A large direct mbuf can even contain multiple packets in se= ries and > > > > > > > each packet can be referenced by multiple mbuf indirections= . > > > > > > > > > > > > > > Signed-off-by: Yongseok Koh > > > > > > > > > > > > I think the current API is already able to do what you want. > > > > > > > > > > > > 1/ Here is a mbuf m with its data > > > > > > > > > > > > off > > > > > > <--> > > > > > > len > > > > > > +----+ <----------> > > > > > > | | > > > > > > +-|----v----------------------+ > > > > > > | | -----------------------| > > > > > > m | buf | XXXXXXXXXXX || > > > > > > | -----------------------| > > > > > > +-----------------------------+ > > > > > > > > > > > > > > > > > > 2/ clone m: > > > > > > > > > > > > c =3D rte_pktmbuf_alloc(pool); > > > > > > rte_pktmbuf_attach(c, m); > > > > > > > > > > > > Note that c has its own offset and length fields. > > > > > > > > > > > > > > > > > > off > > > > > > <--> > > > > > > len > > > > > > +----+ <----------> > > > > > > | | > > > > > > +-|----v----------------------+ > > > > > > | | -----------------------| > > > > > > m | buf | XXXXXXXXXXX || > > > > > > | -----------------------| > > > > > > +------^----------------------+ > > > > > > | > > > > > > +----+ > > > > > > indirect | > > > > > > +-|---------------------------+ > > > > > > | | -----------------------| > > > > > > c | buf | || > > > > > > | -----------------------| > > > > > > +-----------------------------+ > > > > > > > > > > > > off len > > > > > > <--><----------> > > > > > > > > > > > > > > > > > > 3/ remove some data from c without changing m > > > > > > > > > > > > rte_pktmbuf_adj(c, 10) // at head > > > > > > rte_pktmbuf_trim(c, 10) // at tail > > > > > > > > > > > > > > > > > > Please let me know if it fits your needs. > > > > > > > > > > No, it doesn't. > > > > > > > > > > Trimming head and tail with the current APIs removes data and mak= e the space > > > > > available. Adjusting packet head means giving more headroom, not = shifting the > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those = are pointing to > > > > > difference offsets in m, > > > > > > > > > > rte_pktmbuf_adj(c1, 10); > > > > > rte_pktmbuf_adj(c2, 20); > > > > > > > > > > then the owner of c2 regard the first (off+20)B as available head= room. If it > > > > > wants to attach outer header, it will overwrite the headroom even= though the > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for= the outer > > > > > header should be linked by h1->next =3D c2. > > > > > > > > Yes, after these operations c1, c2 and m should become read-only. S= o, to > > > > prepend headers, another mbuf has to be inserted before as you sugg= est. It > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, of= fset, > > > > length) that will: > > > > - alloc and attach indirect mbuf for each segment of m that is > > > > in the range [offset : length+offset]. > > > > - prepend an empty and writable mbuf for the headers > > > > > > > > > If c1 and c2 are attached with shifting buffer address by adjusti= ng buf_off, > > > > > which actually shrink the headroom, this case can be properly han= dled. > > > > > > > > What do you mean by properly handled? > > > > > > > > Yes, prepending data or adding data in the indirect mbuf won't over= ride > > > > the direct mbuf. But prepending data or adding data in the direct m= buf m > > > > won't be protected. > > > > > > > > From an application point of view, indirect mbufs, or direct mbufs = that > > > > have refcnt !=3D 1, should be both considered as read-only because = they > > > > may share their data. How an application can know if the data is sh= ared > > > > or not? > > > > > > > > Maybe we need a flag to differentiate mbufs that are read-only > > > > (something like SHARED_DATA, or simply READONLY). In your case, if = my > > > > understanding is correct, you want to have indirect mbufs with RW d= ata. > > > > > > Agree that indirect mbuf must be treated as read-only, Then the curre= nt code is > > > enough to handle that use-case. > > > > > > > > And another use-case (this is my actual use-case) is to make a la= rge mbuf have > > > > > multiple packets in series. AFAIK, this will also be helpful for = some FPGA NICs > > > > > because it transfers multiple packets to a single large buffer to= reduce PCIe > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx= 5 does. > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one= instead of > > > > > indirect referencing. > > > > But just to make HW to RX multiple packets into one mbuf, > > data_off inside indirect mbuf should be enough, correct? > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough f= or mlx5 > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 pac= kets in > a buffer. So, it needs ~32kB buffer. Having more bits in length fields wo= uld be > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in = order > to saturate the network link. There were few complains that 64KB max is a limitation for some use-cases. I am not against increasing it, but I don't think we have free space on fir= st cache-line for that without another big rework of mbuf layout.=20 Considering that we need to increase size for buf_len, data_off, data_len, = and probably priv_size too.=20 >=20 > > As I understand, what you'd like to achieve with this new field - > > ability to manipulate packet boundaries after RX, probably at upper lay= er. > > As Olivier pointed above, that doesn't sound as safe approach - as you = have multiple > > indirect mbufs trying to modify same direct buffer. >=20 > I agree that there's an implication that indirect mbuf or mbuf having ref= cnt > 1 > is read-only. What that means, all the entities which own such mbufs have= to be > aware of that and keep the principle as DPDK can't enforce the rule and t= here > can't be such sanity check. In this sense, HW doesn't violate it because = the > direct mbuf is injected to HW before indirection. When packets are writte= n by > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to > application layer with freeing the original direct mbuf (decrement refcnt= by 1). > So, HW doesn't touch the direct buffer once it reaches to upper layer. Yes, I understand that. But as I can see you introduced functions to adjust= head and tail, which implies that it should be possible by some entity (upper layer?) to m= anipulate these indirect mbufs. And we don't know how exactly it will be done. > The direct buffer will be freed and get available for reuse when all the = attached > indirect mbufs are freed. >=20 > > Though if you really need to do that, why it can be achieved by updatin= g buf_len and priv_size > > Fields for indirect mbufs, straight after attach()? >=20 > Good point. > Actually that was my draft (Mellanox internal) version of this patch :-) = But I > had to consider a case where priv_size is really given by user. Even thou= gh it > is less likely, but if original priv_size is quite big, it can't cover en= tire > buf_len. For this, I had to increase priv_size to 32-bit but adding anoth= er > 16bit field (buf_off) looked more plausible. As I remember, we can't have mbufs bigger then 64K, so priv_size + buf_len should be always less than 64K, correct? Konstantin =20 >=20 > Thanks for good comments, > Yongseok >=20 > > > > > > > > > > Does this make sense? > > > > > > > > I understand the need. > > > > > > > > Another option would be to make the mbuf->buffer point to an extern= al > > > > buffer (not inside the direct mbuf). This would require to add a > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1= ] for > > > > a quick overview. > > > > > > > > [1] > > > > https://emea01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fdpdk= summit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01 > > > -Session05-OlivierMatz- > > > > Userspace2016.pdf&data=3D02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e= 6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=3Dllw%2BwiY5cC56naOUhB= bIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=3D0 > > > > > > > > The advantage is that it does not require the large data to be insi= de a > > > > mbuf (requiring a mbuf structure before the buffer, and requiring t= o be > > > > allocated from a mempool). On the other hand, it is maybe more comp= lex > > > > to implement compared to your solution. > > > > > > I knew that you presented the slides and frankly, I had considered th= at option > > > at first. But even with that option, metadata to store refcnt should = also be > > > allocated and managed anyway. Kernel also maintains the skb_shared_in= fo at the > > > end of the data segment. Even though it could have smaller metadata s= tructure, > > > I just wanted to make full use of the existing framework because it i= s less > > > complex as you mentioned. Given that you presented the idea of extern= al data > > > buffer in 2016 and there hasn't been many follow-up discussions/activ= ities so > > > far, I thought the demand isn't so big yet thus I wanted to make this= patch > > > simpler. I personally think that we can take the idea of external da= ta seg when > > > more demands come from users in the future as it would be a huge chan= ge and may > > > break current ABI/API. When the day comes, I'll gladly participate in= the > > > discussions and write codes for it if I can be helpful. > > > > > > Do you think this patch is okay for now? > > > > > > > > > Thanks for your comments, > > > Yongseok