From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 0768F5A35 for ; Thu, 21 May 2015 14:12:38 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga103.jf.intel.com with ESMTP; 21 May 2015 05:12:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,468,1427785200"; d="scan'208";a="698290003" Received: from irsmsx108.ger.corp.intel.com ([163.33.3.3]) by orsmga001.jf.intel.com with ESMTP; 21 May 2015 05:12:20 -0700 Received: from irsmsx103.ger.corp.intel.com ([169.254.3.215]) by IRSMSX108.ger.corp.intel.com ([169.254.11.59]) with mapi id 14.03.0224.002; Thu, 21 May 2015 13:12:19 +0100 From: "Richardson, Bruce" To: Neil Horman , Marc Sune Thread-Topic: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type Thread-Index: AQHQjAe5LdE+JdghVUeyOc5gCLzgSJ2DNo4AgAFPUwCAABopAIAABoAAgABtvQCAAB2uAIABEAHQ Date: Thu, 21 May 2015 12:12:18 +0000 Message-ID: <59AF69C657FD0841A61C55336867B5B03453A017@IRSMSX103.ger.corp.intel.com> References: <1431361781-12621-1-git-send-email-bruce.richardson@intel.com> <20150519113112.GA10700@bricha3-MOBL3> <2340603.EB75GY9ZBE@xps13> <555C5C4C.80308@bisdn.de> <20150520102816.GA1775@hmsreliant.think-freely.org> <555CBDCE.5010004@bisdn.de> <20150520184716.GB1775@hmsreliant.think-freely.org> In-Reply-To: <20150520184716.GB1775@hmsreliant.think-freely.org> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.180] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 May 2015 12:12:40 -0000 > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > Sent: Wednesday, May 20, 2015 7:47 PM > To: Marc Sune > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type >=20 > On Wed, May 20, 2015 at 07:01:02PM +0200, Marc Sune wrote: > > > > > > On 20/05/15 12:28, Neil Horman wrote: > > >On Wed, May 20, 2015 at 12:05:00PM +0200, Marc Sune wrote: > > >> > > >>On 20/05/15 10:31, Thomas Monjalon wrote: > > >>>2015-05-19 12:31, Bruce Richardson: > > >>>>On Mon, May 11, 2015 at 05:29:39PM +0100, Bruce Richardson wrote: > > >>>>>Hi all, > > >>>>> > > >>>>>after a small amount of offline discussion with Marc Sune, here > > >>>>>is an alternative proposal for a higher-level interface - aka > > >>>>>pktdev - to allow a common Rx/Tx API across device types handling > > >>>>>mbufs [for now, ethdev, ring and KNI]. The key code is in the > > >>>>>first patch fo the set - the second is an example of a trivial > usecase. > > >>>>> > > >>>>>What is different about this to previously: > > >>>>>* wrapper class, so no changes to any existing ring, ethdev > > >>>>>implementations > > >>>>>* use of function pointers for RX/TX with an API that maps to > ethdev > > >>>>> - this means there is little/no additional overhead for ethdev > calls > > >>>>> - inline special case for rings, to accelerate that. Since we > are at a > > >>>>> higher level, we can special case process some things if > appropriate. This > > >>>>> means the impact to ring ops is one (predictable) branch per > > >>>>>burst > > >>>>>* elimination of the queue abstraction. For the ring and KNI, ther= e > is no > > >>>>> concept of queues, so we just wrap the functions directly (no > need even for > > >>>>> wrapper functions, the api's match so we can call directly). > This also > > >>>>> means: > > >>>>> - adding in features per-queue, is far easier as we don't need > to worry about > > >>>>> having arrays of multiple queues. For example: > > >>>>> - adding in buffering on TX (or RX) is easier since again we > only have a > > >>>>> single queue. > > >>>>>* thread safety is made easier using a wrapper. For a MP ring, we > can create > > >>>>> multiple pktdevs around it, and each thread will then be able t= o > use their > > >>>>> own copy, with their own buffering etc. > > >>>>> > > >>>>>However, at this point, I'm just looking for general feedback on > > >>>>>this as an approach. I think it's quite flexible - even more so > > >>>>>than the earlier proposal we had. It's less proscriptive and > doesn't make any demands on any other libs. > > >>>>> > > >>>>>Comments/thoughts welcome. > > >>>>Any comments on this RFC before I see about investing further time > > >>>>in it to clean it up a bit and submit as a non-RFC patchset for > merge in 2.1? > > >>>I would say there are 2 possible approaches for KNI and ring > handling: > > >>>1/ You Bruce, Marc and Keith are advocating for a layer on top of > > >>>ethdev, ring, KNI and possibly other devices, which uses mbuf. The > > >>>set of functions is simpler than ethdev but the data structure is > > >>>mbuf which is related to ethdev layer. > > >>>2/ Konstantin and Neil talked about keeping mbuf for ethdev layer > > >>>and related libs only. Ring and KNI could have an ethdev API with a > > >>>reduced set of implemented functions. Crypto devices could adopt a > > >>>specific crypto API and an ethdev API at the same time. > > >>I don't fully understand which APIs you meant by non-ethdev. This > > >>pktdev wrapper proposal abstracts RX and TX functions only, and all > > >>of these are using mbufs as the packet buffer abstraction right now > anyway (ethdev). > > >> > > >He's referring to future device classes (like crypto devices), which > > >ostensibly would make use of the pktdev API. My argument (and I > > >think Thomas') is that if a bit of hardware can be made to operate as > > >a packet sending/receiving device, then its just as reasonable to use > > >the existing ethdev api rather than some other restricted version of > > >it (pktdev) > > > > > >>This approach does not preclude that different libraries expose > > >>other API calls. In fact they will have to; setup the port/device > > >>... It is just a higher level API, so that you don't have to check > > >>the type of port in your DPDK application I/O loop, minimizing user's > code. > > >> > > >No argument there. But if thats the case (and I agree that it is), > > >an application will implicitly have to know what what type of device > > >it is, because it (the application) will need to understand the > specific API it is writing to. > > > > > >>Or were you in 2) thinking about creating a different "packet buffer" > > >>abstraction, independent from the ethdev, and then map the different > > >>port specifics (e.g. mbuf) to this new abstraction? > > >> > > >My argument was to just leave the ethdev api alone. If a device > > >class can be made to look like a packet forwarding device, then use > > >the existing ethdev api to implement it. > > > > > >>>I feel it's cleaner, more generic and more maintainable to have > > >>>drivers implementing one or several stable APIs instead of having > > >>>some restricted wrappers to update. > > >>This would be a separate library _on top_ of the existing APIs, and > > >>it has the advantage to simplify the DPDK user's application code > > >>when an application needs to deal with several types of port, as > > >>shown in the example that Bruce provided in PATCH #2. > > >> > > >But thats already the purpose of the ethdev api. Different types of > > >hardware/software can be made to look like the same thing (an ethdev) > > >from an application standpoint. Adding this pktdev layer does > > >nothing but that, add a layer. If you want restricted functionality > > >of an interface, thats ok, ethdev offers that ability. unimplemented > > >methods in a pmd cause the ethdev api to return EOPNOTSUP to the > > >calling application, so the application knows when a given ethdev can'= t > do some aspect of what an ethdev is. > > > > Hi Neil, > > > > Thanks for the clarifications. Now I understand the concern Thomas > > expressed. Using ethdev API (port-ids) was actually my first > > suggestion > > here: > > > > http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/13545 > > > > And to be honest, what I was expecting when I was reading for the > > first time DPDK's APIs. It is indeed an option. However, if we take a > look at the API: > > > > http://www.dpdk.org/doc/api/rte__ethdev_8h.html > > > > none of the API calls, except the burst RX/TX and, perhaps, the > > callbacks, would be used by devices other than NICs. It seems going a > > bit too far using it, but ofc possible. > > > So, I'll make 3 counter-arguments here: >=20 > 1) To your point about the ethdev api being much larger than what a non- > ethernet device could use, I'll tacitly agree, but indicate that its not > relevant. If you want a bit of hardware that isn't a network interface t= o > behave like a network interface, then there are going to be alot of > aspects of a network interface that it just can't do. Thats true > regardless of how you implement that. In the pktdev model, you prevent > those operations from being an option at all, while in the current ethdev > model, you simply get a return code of EOPNOTSUP, and the application doe= s > the right thing (which is to say, it understands that this hardware > doesn't need that aspect of network card mangement and goes on with its > day). I assert that, because we already have the ethdev api, its a lower > time investment to simply reuse it >=20 > 2) To the implication that we aren't working with NICs here, you're > correct. As you note in your previous message, the pktdev interface is i= n > no way the end all and be all of device model design. You will need to > add other api calls to manage the device. If thats the case, then don't > shoehorn any one particular aspect of the API to fit a device model that > the device doesn't conform to. > Design the API so that it best reflects the hardware behavior. >=20 >=20 > 3) An addendum to the point about hardware not being a NIC (and you didn'= t > make this point directly above, but I think you may have mentioned it > previously), sometimes you want a device to behave like another device fo= r > the purposes of using generic code to talk to several device types. Whil= e > this is true, this is a case for device translation and use, not for > carving out parts of an api to make something more generic. The use case > I cited previously was an ipsec tunnel. An ipsec tunnel uses > cryptography, and crypto device apis to encrypt decrypt packet data. The > common way to implement this is to design a crypto api that accepts a > block of data in a way most condusive to the hardware, and then implement > a network driver (that uses whatever ethernet api, in this case the ethde= v > api), to integrate with the network datapath. With this model, the ipsec > tunnel uses the full range of the ethdev api (or a good deal more of it), > and the crypto api is optimized to work with crypo acceleration hardware. >=20 > > In essence, rte_ether(rte_ethdev.h) right now has: i) NIC setup; > > general configuration, queue config, fdir, offloads, hw stuff like > > leds... ii) RX/TX routines and callbacks iii) Stats and queue stats > > iv) other utils for ethernet stuff (rte_ether.h) > > > The key that I'm taking away here is 'right now'. Its already written, s= o > theres no work involved in implementing it for new devices. >=20 > > i) is clearly HW specific, and does only apply to NICs/ASICs (e.g. > > FM10k) > Ok, so it only applies to NIC's, thats fine. If you want to write a > driver that leaves those methods for the pmd set to NULL, the ethdev > library will correctly return EOPNOTSUPP to the calling applications. >=20 > > while ii) and iii) are things that could be abstracted beyond NICs, > > like KNI, rte_ring, crypto... (iv could be moved into some > > utils/protocol parsing libraries). > > > Right again, so let those device types implement the appropriate portions > of the pmd driver structure that match to what they support. EVerything > else is handled by the ethdev library automatically. >=20 > > Perhaps these two groups could be split into two different libraries > > and then ii) and iii) together would be something like ~ rte_pktdev > > (stats are missing on the proposed patch), while i) would be > > rte_ether, or rte_nic if we think it is a better name. > > > The point I'm trying to get to is, why split at all? Theres just no need > that I can see. The example I would set here is the dummy driver in linux= . > Its a net device that only serves to act as a sink for network packets. > It still uses the network driver interface, but of the 65-ish methods tha= t > the netdevice model in linux offers, it implements 8 (or approximately > 12%). The other unused method are just that, unused, and thats ok. > Applications that try to do things like set flow director options, or > speed/duplex options gets a return code that effectively says "This devic= e > can't do that", and thats ok. Thats what we need to be doing here. > Instead of finding a way to codify the subset of functionality that other > devices might be able to implement, for those cases where we want other > hardware to act like a netdevice, lets just let those devices pick and > choose what to implement, and the interface we already have will > communicate with applications appropriately. >=20 > Regards > Neil >=20 Hi Neil, First off, a note on the naming and the basic concept: this proposal is not= trying to make everything look like NIC, rather we are trying to make a bu= nch of different components appear as generic sources/sinks for pkts or mbu= fs. From my point of view, it's an important difference. Be that as it may, I'd like to first deal with the whole idea of the applic= ation needing to know about the type of the underlying device. For me, this= is a critical point. Applications - such as all our sample apps - have ess= entially two parts: * an initialization and control part * a data-path part. These two parts are very, very different in what they do. The initializatio= n part - which e.g. in testpmd continues on in the form of the cmdline inte= rface as a control part - does the initial setup of devices/rings/etc. and = potentially makes use of the full APIs provided by the ethdev interface. It= 's also not performance critical, as evidenced by the fact that the APIs us= ed there have additional checks for valid input etc. The second, data-path = part, is entirely the part that this proposal is targeting. This data-path = is completely separate in the application, is highly performance sensitive,= and rarely, if ever needs to know or care about the actual source of its d= ata. So the idea behind this library is that you can write your initializat= ion control parts of your app as-now, fully aware of the underlying types i= nvolved, and without ever using rx/tx burst. Then when you have the various= devices and DPDK objects set up, you spawn your data-path threads and pass= each one the set of input and outputs it needs, in the form of your generi= c packet source objects. This distinction is also why I'm not particularly interested in the ability= to pass in different objects via cmdline, as is done now with pcap/ring PM= Ds. That's ok when you want the initialization part of your app to be obliv= ious to see everything under a common abstraction, but when it's only the d= ata path you want to work with generic packet objects that's unnecessary, a= nd the initialization path should be able to convert any of the required in= put/output sources to a generic type using a single API call. [This doesn't= rule out specifying different inputs/outputs on the commandline, it's just= you can specify them as their native types, rather than hiding them under = a common API at the control-path level]. As for what that abstraction should be. There are a number of issues I see = with ethdev - as it is right now, as that common abstraction. 1. The use of port-ids. I think port ids are fine for numbering physical po= rts, but I think pointers are better for passing around objects to be worke= d on by the data path. What is more concerning [than my opinion on numbers = vs pointers :-)] is the fact that we are limited to 256 port ids. Yes, that= can be changed, but the impacts are massive. To change the type, we would = break the ABI for every single ethdev API, as well as likely other function= s too. Furthermore, increasing the size of the port id would require a chan= ge to the internals of the mbuf structure, which would lead to the ABI bein= g broken for any function that uses mbufs. By adopting an API, such as prop= osed, which uses pointers, we avoid the problem, as port ids would only app= ly to ethdevs. 2. Simplicity. While you say that its fine for an ethdev not to implement a= ll the functions in the ethdev API, to create a proper PMD like you are pro= posing involves a good deal more work than using the proposed pktdev abstra= ction. If it's to appear like a proper NIC to the control paths, as well as= the init paths - which seems to be what you imply - you really do need to = implement additional functions like queue setup, and start and stop. While = it's true that the library can return -ENOSUP on an unsupported function, I= don't believe any of our sample apps are set up to check for this on NIC s= etup, and therefore I would hazard a guess that real-world customer apps ar= en't set up to handle it either. 3. Performance for rings. While not applicable for all cases, the performan= ce of the rings under an ethdev abstraction would not be the same as here. = For example, when polling on an empty ring for packets, the current time ta= ken by our ring rx/tx functions, is literally a few cycles (as tested by th= e rings autotest). If these functions cannot be inlined, that cycle count g= oes up to 3x what it is now. [I observed this previously when doing reworki= ng of the rings code, and the code-size led to icc no longer doing inlining= . In that case, the gcc code for empty polling was indeed 3 times faster th= an the icc version. Adding forced inlining made things equal again]. This m= etric of empty polling may seem trivial i.e. "if there are no packets, why = does it matter how long it takes?", but is important in real-world cases wh= ere you are pulling packets from multiple sources, and your application is = only currently dealing with input on one of them. [Often tested to see how = an application handles in a single-flow situation - an metric our customers= do look at]. Even in the non-empty situation, for smaller packet bursts, t= he overhead of the function call may slow things down. [For larger bursts, = e.g. 32, the effect should not be noticeable, I suspect]. The only other final point I'd make here is that what is proposed is not pr= oscriptive - whatever a future API for handling other device types, such as= crypto devices, may look like can be decided separately from this pktdev i= mplementation. Whether one chooses pktdev or ethdev as a common abstraction= layer type, the decision of whether or not a particular object type is all= owed to be made look like that common type can be made entirely independent= ly, and based upon whether or not such a type-conversion makes sense. Regards, /Bruce