From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bruce.richardson@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 0768F5A35
 for <dev@dpdk.org>; Thu, 21 May 2015 14:12:38 +0200 (CEST)
Received: from orsmga001.jf.intel.com ([10.7.209.18])
 by orsmga103.jf.intel.com with ESMTP; 21 May 2015 05:12:21 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.13,468,1427785200"; d="scan'208";a="698290003"
Received: from irsmsx108.ger.corp.intel.com ([163.33.3.3])
 by orsmga001.jf.intel.com with ESMTP; 21 May 2015 05:12:20 -0700
Received: from irsmsx103.ger.corp.intel.com ([169.254.3.215]) by
 IRSMSX108.ger.corp.intel.com ([169.254.11.59]) with mapi id 14.03.0224.002;
 Thu, 21 May 2015 13:12:19 +0100
From: "Richardson, Bruce" <bruce.richardson@intel.com>
To: Neil Horman <nhorman@tuxdriver.com>, Marc Sune <marc.sune@bisdn.de>
Thread-Topic: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type
Thread-Index: AQHQjAe5LdE+JdghVUeyOc5gCLzgSJ2DNo4AgAFPUwCAABopAIAABoAAgABtvQCAAB2uAIABEAHQ
Date: Thu, 21 May 2015 12:12:18 +0000
Message-ID: <59AF69C657FD0841A61C55336867B5B03453A017@IRSMSX103.ger.corp.intel.com>
References: <1431361781-12621-1-git-send-email-bruce.richardson@intel.com>
 <20150519113112.GA10700@bricha3-MOBL3> <2340603.EB75GY9ZBE@xps13>
 <555C5C4C.80308@bisdn.de>
 <20150520102816.GA1775@hmsreliant.think-freely.org>
 <555CBDCE.5010004@bisdn.de>
 <20150520184716.GB1775@hmsreliant.think-freely.org>
In-Reply-To: <20150520184716.GB1775@hmsreliant.think-freely.org>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.180]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 21 May 2015 12:12:40 -0000


> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> Sent: Wednesday, May 20, 2015 7:47 PM
> To: Marc Sune
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type
>=20
> On Wed, May 20, 2015 at 07:01:02PM +0200, Marc Sune wrote:
> >
> >
> > On 20/05/15 12:28, Neil Horman wrote:
> > >On Wed, May 20, 2015 at 12:05:00PM +0200, Marc Sune wrote:
> > >>
> > >>On 20/05/15 10:31, Thomas Monjalon wrote:
> > >>>2015-05-19 12:31, Bruce Richardson:
> > >>>>On Mon, May 11, 2015 at 05:29:39PM +0100, Bruce Richardson wrote:
> > >>>>>Hi all,
> > >>>>>
> > >>>>>after a small amount of offline discussion with Marc Sune, here
> > >>>>>is an alternative proposal for a higher-level interface - aka
> > >>>>>pktdev - to allow a common Rx/Tx API across device types handling
> > >>>>>mbufs [for now, ethdev, ring and KNI]. The key code is in the
> > >>>>>first patch fo the set - the second is an example of a trivial
> usecase.
> > >>>>>
> > >>>>>What is different about this to previously:
> > >>>>>* wrapper class, so no changes to any existing ring, ethdev
> > >>>>>implementations
> > >>>>>* use of function pointers for RX/TX with an API that maps to
> ethdev
> > >>>>>   - this means there is little/no additional overhead for ethdev
> calls
> > >>>>>   - inline special case for rings, to accelerate that. Since we
> are at a
> > >>>>>     higher level, we can special case process some things if
> appropriate. This
> > >>>>>     means the impact to ring ops is one (predictable) branch per
> > >>>>>burst
> > >>>>>* elimination of the queue abstraction. For the ring and KNI, ther=
e
> is no
> > >>>>>   concept of queues, so we just wrap the functions directly (no
> need even for
> > >>>>>   wrapper functions, the api's match so we can call directly).
> This also
> > >>>>>   means:
> > >>>>>   - adding in features per-queue, is far easier as we don't need
> to worry about
> > >>>>>     having arrays of multiple queues. For example:
> > >>>>>   - adding in buffering on TX (or RX) is easier since again we
> only have a
> > >>>>>     single queue.
> > >>>>>* thread safety is made easier using a wrapper. For a MP ring, we
> can create
> > >>>>>   multiple pktdevs around it, and each thread will then be able t=
o
> use their
> > >>>>>   own copy, with their own buffering etc.
> > >>>>>
> > >>>>>However, at this point, I'm just looking for general feedback on
> > >>>>>this as an approach. I think it's quite flexible - even more so
> > >>>>>than the earlier proposal we had. It's less proscriptive and
> doesn't make any demands on any other libs.
> > >>>>>
> > >>>>>Comments/thoughts welcome.
> > >>>>Any comments on this RFC before I see about investing further time
> > >>>>in it to clean it up a bit and submit as a non-RFC patchset for
> merge in 2.1?
> > >>>I would say there are 2 possible approaches for KNI and ring
> handling:
> > >>>1/ You Bruce, Marc and Keith are advocating for a layer on top of
> > >>>ethdev, ring, KNI and possibly other devices, which uses mbuf. The
> > >>>set of functions is simpler than ethdev but the data structure is
> > >>>mbuf which is related to ethdev layer.
> > >>>2/ Konstantin and Neil talked about keeping mbuf for ethdev layer
> > >>>and related libs only. Ring and KNI could have an ethdev API with a
> > >>>reduced set of implemented functions. Crypto devices could adopt a
> > >>>specific crypto API and an ethdev API at the same time.
> > >>I don't fully understand which APIs you meant by non-ethdev. This
> > >>pktdev wrapper proposal abstracts RX and TX functions only, and all
> > >>of these are using mbufs as the packet buffer abstraction right now
> anyway (ethdev).
> > >>
> > >He's referring to future device classes (like crypto devices), which
> > >ostensibly would make use of the pktdev API.  My argument (and I
> > >think Thomas') is that if a bit of hardware can be made to operate as
> > >a packet sending/receiving device, then its just as reasonable to use
> > >the existing ethdev api rather than some other restricted version of
> > >it (pktdev)
> > >
> > >>This approach does not preclude that different libraries expose
> > >>other API calls. In fact they will have to; setup the port/device
> > >>... It is just a higher level API, so that you don't have to check
> > >>the type of port in your DPDK application I/O loop, minimizing user's
> code.
> > >>
> > >No argument there.  But if thats the case (and I agree that it is),
> > >an application will implicitly have to know what what type of device
> > >it is, because it (the application) will need to understand the
> specific API it is writing to.
> > >
> > >>Or were you in 2) thinking about creating a different "packet buffer"
> > >>abstraction, independent from the ethdev, and then map the different
> > >>port specifics (e.g. mbuf) to this new abstraction?
> > >>
> > >My argument was to just leave the ethdev api alone.  If a device
> > >class can be made to look like a packet forwarding device, then use
> > >the existing ethdev api to implement it.
> > >
> > >>>I feel it's cleaner, more generic and more maintainable to have
> > >>>drivers implementing one or several stable APIs instead of having
> > >>>some restricted wrappers to update.
> > >>This would be a separate library _on top_ of the existing APIs, and
> > >>it has the advantage to simplify the DPDK user's application code
> > >>when an application needs to deal with several types of port, as
> > >>shown in the example that Bruce provided in PATCH #2.
> > >>
> > >But thats already the purpose of the ethdev api.  Different types of
> > >hardware/software can be made to look like the same thing (an ethdev)
> > >from an application standpoint.  Adding this pktdev layer does
> > >nothing but that, add a layer.  If you want restricted functionality
> > >of an interface, thats ok, ethdev offers that ability.  unimplemented
> > >methods in a pmd cause the ethdev api to return EOPNOTSUP to the
> > >calling application, so the application knows when a given ethdev can'=
t
> do some aspect of what an ethdev is.
> >
> > Hi Neil,
> >
> > Thanks for the clarifications. Now I understand the concern Thomas
> > expressed. Using ethdev API (port-ids) was actually my first
> > suggestion
> > here:
> >
> > http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/13545
> >
> > And to be honest, what I was expecting when I was reading for the
> > first time DPDK's APIs. It is indeed an option. However, if we take a
> look at the API:
> >
> > http://www.dpdk.org/doc/api/rte__ethdev_8h.html
> >
> > none of the API calls, except the burst RX/TX and, perhaps, the
> > callbacks, would be used by devices other than NICs. It seems going a
> > bit too far using it, but ofc possible.
> >
> So, I'll make 3 counter-arguments here:
>=20
> 1) To your point about the ethdev api being much larger than what a non-
> ethernet device could use, I'll tacitly agree, but indicate that its not
> relevant.  If you want a bit of hardware that isn't a network interface t=
o
> behave like a network interface, then there are going to be alot of
> aspects of a network interface that it just can't do.  Thats true
> regardless of how you implement that.  In the pktdev model, you prevent
> those operations from being an option at all, while in the current ethdev
> model, you simply get a return code of EOPNOTSUP, and the application doe=
s
> the right thing (which is to say, it understands that this hardware
> doesn't need that aspect of network card mangement and goes on with its
> day).  I assert that, because we already have the ethdev api, its a lower
> time investment to simply reuse it
>=20
> 2) To the implication that we aren't working with NICs here, you're
> correct.  As you note in your previous message, the pktdev interface is i=
n
> no way the end all and be all of device model design.  You will need to
> add other api calls to manage the device.  If thats the case, then don't
> shoehorn any one particular aspect of the API to fit a device model that
> the device doesn't conform to.
> Design the API so that it best reflects the hardware behavior.
>=20
>=20
> 3) An addendum to the point about hardware not being a NIC (and you didn'=
t
> make this point directly above, but I think you may have mentioned it
> previously), sometimes you want a device to behave like another device fo=
r
> the purposes of using generic code to talk to several device types.  Whil=
e
> this is true, this is a case for device translation and use, not for
> carving out parts of an api to make something more generic.  The use case
> I cited previously was an ipsec tunnel.  An ipsec tunnel uses
> cryptography, and crypto device apis to encrypt decrypt packet data.  The
> common way to implement this is to design a crypto api that accepts a
> block of data in a way most condusive to the hardware, and then implement
> a network driver (that uses whatever ethernet api, in this case the ethde=
v
> api), to integrate with the network datapath.  With this model, the ipsec
> tunnel uses the full range of the ethdev api (or a good deal more of it),
> and the crypto api is optimized to work with crypo acceleration hardware.
>=20
> > In essence, rte_ether(rte_ethdev.h) right now has: i) NIC setup;
> > general configuration, queue config, fdir, offloads, hw stuff like
> > leds... ii) RX/TX routines and callbacks iii) Stats and queue stats
> > iv) other utils for ethernet stuff (rte_ether.h)
> >
> The key that I'm taking away here is 'right now'.  Its already written, s=
o
> theres no work involved in implementing it for new devices.
>=20
> > i) is clearly HW specific, and does only apply to NICs/ASICs (e.g.
> > FM10k)
> Ok, so it only applies to NIC's, thats fine.  If you want to write a
> driver that leaves those methods for the pmd set to NULL, the ethdev
> library will correctly return EOPNOTSUPP to the calling applications.
>=20
> > while ii) and iii) are things that could be abstracted beyond NICs,
> > like KNI, rte_ring, crypto... (iv could be moved into some
> > utils/protocol parsing libraries).
> >
> Right again, so let those device types implement the appropriate portions
> of the pmd driver structure that match to what they support.  EVerything
> else is handled by the ethdev library automatically.
>=20
> > Perhaps these two groups could be split into two different libraries
> > and then ii) and iii) together would be something like ~ rte_pktdev
> > (stats are missing on the proposed patch), while i) would be
> > rte_ether, or rte_nic if we think it is a better name.
> >
> The point I'm trying to get to is, why split at all?  Theres just no need
> that I can see. The example I would set here is the dummy driver in linux=
.
> Its a net device that only serves to act as a sink for network packets.
> It still uses the network driver interface, but of the 65-ish methods tha=
t
> the netdevice model in linux offers, it implements 8 (or approximately
> 12%).  The other unused method are just that, unused, and thats ok.
> Applications that try to do things like set flow director options, or
> speed/duplex options gets a return code that effectively says "This devic=
e
> can't do that", and thats ok.  Thats what we need to be doing here.
> Instead of finding a way to codify the subset of functionality that other
> devices might be able to implement, for those cases where we want other
> hardware to act like a netdevice, lets just let those devices pick and
> choose what to implement, and the interface we already have will
> communicate with applications appropriately.
>=20
> Regards
> Neil
>=20

Hi Neil,

First off, a note on the naming and the basic concept: this proposal is not=
 trying to make everything look like NIC, rather we are trying to make a bu=
nch of different components appear as generic sources/sinks for pkts or mbu=
fs. From my point of view, it's an important difference.

Be that as it may, I'd like to first deal with the whole idea of the applic=
ation needing to know about the type of the underlying device. For me, this=
 is a critical point. Applications - such as all our sample apps - have ess=
entially two parts:
* an initialization and control part
* a data-path part.
These two parts are very, very different in what they do. The initializatio=
n part - which e.g. in testpmd continues on in the form of the cmdline inte=
rface as a control part - does the initial setup of devices/rings/etc. and =
potentially makes use of the full APIs provided by the ethdev interface. It=
's also not performance critical, as evidenced by the fact that the APIs us=
ed there have additional checks for valid input etc. The second, data-path =
part, is entirely the part that this proposal is targeting. This data-path =
is completely separate in the application, is highly performance sensitive,=
 and rarely, if ever needs to know or care about the actual source of its d=
ata. So the idea behind this library is that you can write your initializat=
ion control parts of your app as-now, fully aware of the underlying types i=
nvolved, and without ever using rx/tx burst. Then when you have the various=
 devices and DPDK objects set up, you spawn your data-path threads and pass=
 each one the set of input and outputs it needs, in the form of your generi=
c packet source objects.

This distinction is also why I'm not particularly interested in the ability=
 to pass in different objects via cmdline, as is done now with pcap/ring PM=
Ds. That's ok when you want the initialization part of your app to be obliv=
ious to see everything under a common abstraction, but when it's only the d=
ata path you want to work with generic packet objects that's unnecessary, a=
nd the initialization path should be able to convert any of the required in=
put/output sources to a generic type using a single API call. [This doesn't=
 rule out specifying different inputs/outputs on the commandline, it's just=
 you can specify them as their native types, rather than hiding them under =
a common API at the control-path level].

As for what that abstraction should be. There are a number of issues I see =
with ethdev - as it is right now, as that common abstraction.

1. The use of port-ids. I think port ids are fine for numbering physical po=
rts, but I think pointers are better for passing around objects to be worke=
d on by the data path. What is more concerning [than my opinion on numbers =
vs pointers :-)] is the fact that we are limited to 256 port ids. Yes, that=
 can be changed, but the impacts are massive. To change the type, we would =
break the ABI for every single ethdev API, as well as likely other function=
s too. Furthermore, increasing the size of the port id would require a chan=
ge to the internals of the mbuf structure, which would lead to the ABI bein=
g broken for any function that uses mbufs. By adopting an API, such as prop=
osed, which uses pointers, we avoid the problem, as port ids would only app=
ly to ethdevs.

2. Simplicity. While you say that its fine for an ethdev not to implement a=
ll the functions in the ethdev API, to create a proper PMD like you are pro=
posing involves a good deal more work than using the proposed pktdev abstra=
ction. If it's to appear like a proper NIC to the control paths, as well as=
 the init paths - which seems to be what you imply - you really do need to =
implement additional functions like queue setup, and start and stop. While =
it's true that the library can return -ENOSUP on an unsupported function, I=
 don't believe any of our sample apps are set up to check for this on NIC s=
etup, and therefore I would hazard a guess that real-world customer apps ar=
en't set up to handle it either.

3. Performance for rings. While not applicable for all cases, the performan=
ce of the rings under an ethdev abstraction would not be the same as here. =
For example, when polling on an empty ring for packets, the current time ta=
ken by our ring rx/tx functions, is literally a few cycles (as tested by th=
e rings autotest). If these functions cannot be inlined, that cycle count g=
oes up to 3x what it is now. [I observed this previously when doing reworki=
ng of the rings code, and the code-size led to icc no longer doing inlining=
. In that case, the gcc code for empty polling was indeed 3 times faster th=
an the icc version. Adding forced inlining made things equal again]. This m=
etric of empty polling may seem trivial i.e. "if there are no packets, why =
does it matter how long it takes?", but is important in real-world cases wh=
ere you are pulling packets from multiple sources, and your application is =
only currently dealing with input on one of them. [Often tested to see how =
an application handles in a single-flow situation - an metric our customers=
 do look at]. Even in the non-empty situation, for smaller packet bursts, t=
he overhead of the function call may slow things down. [For larger bursts, =
e.g. 32, the effect should not be noticeable, I suspect].

The only other final point I'd make here is that what is proposed is not pr=
oscriptive - whatever a future API for handling other device types, such as=
 crypto devices, may look like can be decided separately from this pktdev i=
mplementation. Whether one chooses pktdev or ethdev as a common abstraction=
 layer type, the decision of whether or not a particular object type is all=
owed to be made look like that common type can be made entirely independent=
ly, and based upon whether or not such a type-conversion makes sense.

Regards,
/Bruce