From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 0D1C91C150 for ; Thu, 12 Apr 2018 18:35:00 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Apr 2018 09:34:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,442,1517904000"; d="scan'208";a="216114735" Received: from irsmsx151.ger.corp.intel.com ([163.33.192.59]) by orsmga005.jf.intel.com with ESMTP; 12 Apr 2018 09:34:58 -0700 Received: from irsmsx102.ger.corp.intel.com ([169.254.2.164]) by IRSMSX151.ger.corp.intel.com ([169.254.4.181]) with mapi id 14.03.0319.002; Thu, 12 Apr 2018 17:34:57 +0100 From: "Ananyev, Konstantin" To: Yongseok Koh CC: Olivier Matz , "Lu, Wenzhuo" , "Wu, Jingjing" , "Adrien Mazarguil" , =?iso-8859-1?Q?N=E9lio_Laranjeiro?= , "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Thread-Index: AQHTyyWFk8vygk5Jwk2G86W8WINYuKPvrBIAgAjlxgCAAKYZgIABhjNggABH7QCAAG6g8IAAU5eAgAGTSpA= Date: Thu, 12 Apr 2018 16:34:56 +0000 Message-ID: <2601191342CEEE43887BDE71AB977258AE914692@IRSMSX102.ger.corp.intel.com> References: <20180310012532.15809-1-yskoh@mellanox.com> <20180402185008.13073-1-yskoh@mellanox.com> <20180402185008.13073-2-yskoh@mellanox.com> <20180403082615.etnr33cuyey7i3u3@platinum> <20180404001205.GB1867@yongseok-MBP.local> <20180409160434.kmw4iyztemrkzmtc@platinum> <20180410015902.GA20627@yongseok-MBP.local> <2601191342CEEE43887BDE71AB977258AE91344A@IRSMSX102.ger.corp.intel.com> <20180411053302.GA26252@yongseok-MBP.local> <2601191342CEEE43887BDE71AB977258AE913944@IRSMSX102.ger.corp.intel.com> <20180411170810.GA27791@yongseok-MBP.local> In-Reply-To: <20180411170810.GA27791@yongseok-MBP.local> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNDM3YjE2OGMtZTgxNC00MTFiLWExN2QtOWZhZjkwMTU1ZGM4IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IldVcmRadWhcL3pWMWtuVVwvN1ZJYWppTkxtbFhtS0l6aHNoQWN1NHV6RUJaRT0ifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.200.100 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Apr 2018 16:35:01 -0000 > > > > > > > > > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote: > > > > > > Hi Yongseok, > > > > > > > > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote: > > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrot= e: > > > > > > > > > When attaching a mbuf, indirect mbuf has to point to star= t of buffer of > > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this be= comes more > > > > > > > > > flexible. Indirect mbuf can point to any part of direct m= buf by calling > > > > > > > > > rte_pktmbuf_attach_at(). > > > > > > > > > > > > > > > > > > Possible use-cases could be: > > > > > > > > > - If a packet has multiple layers of encapsulation, multi= ple indirect > > > > > > > > > buffers can reference different layers of the encapsula= ted packet. > > > > > > > > > - A large direct mbuf can even contain multiple packets i= n series and > > > > > > > > > each packet can be referenced by multiple mbuf indirect= ions. > > > > > > > > > > > > > > > > > > Signed-off-by: Yongseok Koh > > > > > > > > > > > > > > > > I think the current API is already able to do what you want= . > > > > > > > > > > > > > > > > 1/ Here is a mbuf m with its data > > > > > > > > > > > > > > > > off > > > > > > > > <--> > > > > > > > > len > > > > > > > > +----+ <----------> > > > > > > > > | | > > > > > > > > +-|----v----------------------+ > > > > > > > > | | -----------------------| > > > > > > > > m | buf | XXXXXXXXXXX || > > > > > > > > | -----------------------| > > > > > > > > +-----------------------------+ > > > > > > > > > > > > > > > > > > > > > > > > 2/ clone m: > > > > > > > > > > > > > > > > c =3D rte_pktmbuf_alloc(pool); > > > > > > > > rte_pktmbuf_attach(c, m); > > > > > > > > > > > > > > > > Note that c has its own offset and length fields. > > > > > > > > > > > > > > > > > > > > > > > > off > > > > > > > > <--> > > > > > > > > len > > > > > > > > +----+ <----------> > > > > > > > > | | > > > > > > > > +-|----v----------------------+ > > > > > > > > | | -----------------------| > > > > > > > > m | buf | XXXXXXXXXXX || > > > > > > > > | -----------------------| > > > > > > > > +------^----------------------+ > > > > > > > > | > > > > > > > > +----+ > > > > > > > > indirect | > > > > > > > > +-|---------------------------+ > > > > > > > > | | -----------------------| > > > > > > > > c | buf | || > > > > > > > > | -----------------------| > > > > > > > > +-----------------------------+ > > > > > > > > > > > > > > > > off len > > > > > > > > <--><----------> > > > > > > > > > > > > > > > > > > > > > > > > 3/ remove some data from c without changing m > > > > > > > > > > > > > > > > rte_pktmbuf_adj(c, 10) // at head > > > > > > > > rte_pktmbuf_trim(c, 10) // at tail > > > > > > > > > > > > > > > > > > > > > > > > Please let me know if it fits your needs. > > > > > > > > > > > > > > No, it doesn't. > > > > > > > > > > > > > > Trimming head and tail with the current APIs removes data and= make the space > > > > > > > available. Adjusting packet head means giving more headroom, = not shifting the > > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and th= ose are pointing to > > > > > > > difference offsets in m, > > > > > > > > > > > > > > rte_pktmbuf_adj(c1, 10); > > > > > > > rte_pktmbuf_adj(c2, 20); > > > > > > > > > > > > > > then the owner of c2 regard the first (off+20)B as available = headroom. If it > > > > > > > wants to attach outer header, it will overwrite the headroom = even though the > > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1)= for the outer > > > > > > > header should be linked by h1->next =3D c2. > > > > > > > > > > > > Yes, after these operations c1, c2 and m should become read-onl= y. So, to > > > > > > prepend headers, another mbuf has to be inserted before as you = suggest. It > > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m= , offset, > > > > > > length) that will: > > > > > > - alloc and attach indirect mbuf for each segment of m that i= s > > > > > > in the range [offset : length+offset]. > > > > > > - prepend an empty and writable mbuf for the headers > > > > > > > > > > > > > If c1 and c2 are attached with shifting buffer address by adj= usting buf_off, > > > > > > > which actually shrink the headroom, this case can be properly= handled. > > > > > > > > > > > > What do you mean by properly handled? > > > > > > > > > > > > Yes, prepending data or adding data in the indirect mbuf won't = override > > > > > > the direct mbuf. But prepending data or adding data in the dire= ct mbuf m > > > > > > won't be protected. > > > > > > > > > > > > From an application point of view, indirect mbufs, or direct mb= ufs that > > > > > > have refcnt !=3D 1, should be both considered as read-only beca= use they > > > > > > may share their data. How an application can know if the data i= s shared > > > > > > or not? > > > > > > > > > > > > Maybe we need a flag to differentiate mbufs that are read-only > > > > > > (something like SHARED_DATA, or simply READONLY). In your case,= if my > > > > > > understanding is correct, you want to have indirect mbufs with = RW data. > > > > > > > > > > Agree that indirect mbuf must be treated as read-only, Then the c= urrent code is > > > > > enough to handle that use-case. > > > > > > > > > > > > And another use-case (this is my actual use-case) is to make = a large mbuf have > > > > > > > multiple packets in series. AFAIK, this will also be helpful = for some FPGA NICs > > > > > > > because it transfers multiple packets to a single large buffe= r to reduce PCIe > > > > > > > overhead for small packet traffic like the Multi-Packet Rx of= mlx5 does. > > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by= one instead of > > > > > > > indirect referencing. > > > > > > > > But just to make HW to RX multiple packets into one mbuf, > > > > data_off inside indirect mbuf should be enough, correct? > > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enou= gh for mlx5 > > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16= packets in > > > a buffer. So, it needs ~32kB buffer. Having more bits in length field= s would be > > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck= in order > > > to saturate the network link. > > > > There were few complains that 64KB max is a limitation for some use-cas= es. > > I am not against increasing it, but I don't think we have free space on= first cache-line for that > > without another big rework of mbuf layout. > > Considering that we need to increase size for buf_len, data_off, data_l= en, and probably priv_size too. > > > > > > > > > As I understand, what you'd like to achieve with this new field - > > > > ability to manipulate packet boundaries after RX, probably at upper= layer. > > > > As Olivier pointed above, that doesn't sound as safe approach - as = you have multiple > > > > indirect mbufs trying to modify same direct buffer. > > > > > > I agree that there's an implication that indirect mbuf or mbuf having= refcnt > 1 > > > is read-only. What that means, all the entities which own such mbufs = have to be > > > aware of that and keep the principle as DPDK can't enforce the rule a= nd there > > > can't be such sanity check. In this sense, HW doesn't violate it beca= use the > > > direct mbuf is injected to HW before indirection. When packets are wr= itten by > > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those = to > > > application layer with freeing the original direct mbuf (decrement re= fcnt by 1). > > > So, HW doesn't touch the direct buffer once it reaches to upper layer= . > > > > Yes, I understand that. But as I can see you introduced functions to ad= just head and tail, > > which implies that it should be possible by some entity (upper layer?) = to manipulate these > > indirect mbufs. > > And we don't know how exactly it will be done. >=20 > That's a valid concern. I can make it private by merging into the _attach= _to() > func, or I just can add a comment in the API doc. However, if users are a= ware > that a mbuf is read-only and we expect them to keep it intact by their ow= n > judgement, they would/should not use those APIs. We can't stop them modif= ying > content or the buffer itself anyway. Will add more comments of this discu= ssion > regarding read-only mode. Ok, so these functions are intended to be used only by PMD level? But in that case do you need them at all? Isn't it possible implement same thing with just data_off? I mean your PMD knows in advance what is the buf_len of mbuf and at startup time it can decide it going to slice it how to slice it into multiple packe= ts. So each offset is known in advance and you don't need to worry that you'll = overwrite neighbor packet's data.=20 >=20 > > > The direct buffer will be freed and get available for reuse when all = the attached > > > indirect mbufs are freed. > > > > > > > Though if you really need to do that, why it can be achieved by upd= ating buf_len and priv_size > > > > Fields for indirect mbufs, straight after attach()? > > > > > > Good point. > > > Actually that was my draft (Mellanox internal) version of this patch = :-) But I > > > had to consider a case where priv_size is really given by user. Even = though it > > > is less likely, but if original priv_size is quite big, it can't cove= r entire > > > buf_len. For this, I had to increase priv_size to 32-bit but adding a= nother > > > 16bit field (buf_off) looked more plausible. > > > > As I remember, we can't have mbufs bigger then 64K, > > so priv_size + buf_len should be always less than 64K, correct? >=20 > Can you let me know where I can find the constraint? I checked > rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mi= stake > but there's no such limitation. >=20 > elt_size =3D sizeof(struct rte_mbuf) + (unsigned)priv_size + > (unsigned)data_room_size; Ok I scanned through librte_mbuf and didn't find any limitations. Seems like a false impression from my side. Anyway that seems like a corner case to have priv_szie + buf_len >64KB. Do you really need to support it? Konstantin >=20 > The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts fr= om 'm + > sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. S= o, > priv_size couldn't be used for this purpose. >=20 > Yongseok >=20 > > > > > > > > > > > > > > Does this make sense? > > > > > > > > > > > > I understand the need. > > > > > > > > > > > > Another option would be to make the mbuf->buffer point to an ex= ternal > > > > > > buffer (not inside the direct mbuf). This would require to add = a > > > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) i= n [1] for > > > > > > a quick overview. > > > > > > > > > > > > [1] > > > > > > > > > https://emea01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fdpdk= summit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01 > > > > > -Session05-OlivierMatz- > > > > > > > > > Userspace2016.pdf&data=3D02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e= 6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d > > > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=3Dllw%2BwiY5cC56na= OUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=3D0 > > > > > > > > > > > > The advantage is that it does not require the large data to be = inside a > > > > > > mbuf (requiring a mbuf structure before the buffer, and requiri= ng to be > > > > > > allocated from a mempool). On the other hand, it is maybe more = complex > > > > > > to implement compared to your solution. > > > > > > > > > > I knew that you presented the slides and frankly, I had considere= d that option > > > > > at first. But even with that option, metadata to store refcnt sho= uld also be > > > > > allocated and managed anyway. Kernel also maintains the skb_share= d_info at the > > > > > end of the data segment. Even though it could have smaller metada= ta structure, > > > > > I just wanted to make full use of the existing framework because = it is less > > > > > complex as you mentioned. Given that you presented the idea of ex= ternal data > > > > > buffer in 2016 and there hasn't been many follow-up discussions/a= ctivities so > > > > > far, I thought the demand isn't so big yet thus I wanted to make = this patch > > > > > simpler. I personally think that we can take the idea of externa= l data seg when > > > > > more demands come from users in the future as it would be a huge = change and may > > > > > break current ABI/API. When the day comes, I'll gladly participat= e in the > > > > > discussions and write codes for it if I can be helpful. > > > > > > > > > > Do you think this patch is okay for now? > > > > > > > > > > > > > > > Thanks for your comments, > > > > > Yongseok