From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 1E7B55678 for ; Fri, 25 Nov 2016 12:00:57 +0100 (CET) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga104.fm.intel.com with ESMTP; 25 Nov 2016 03:00:57 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,546,1473145200"; d="scan'208";a="905356927" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.64]) by orsmga003.jf.intel.com with SMTP; 25 Nov 2016 03:00:54 -0800 Received: by (sSMTP sendmail emulation); Fri, 25 Nov 2016 11:00:53 +0000 Date: Fri, 25 Nov 2016 11:00:53 +0000 From: Bruce Richardson To: Jerin Jacob Cc: Thomas Monjalon , dev@dpdk.org, harry.van.haaren@intel.com, hemant.agrawal@nxp.com, gage.eads@intel.com Message-ID: <20161125110053.GA149796@bricha3-MOBL3.ger.corp.intel.com> References: <1479447902-3700-1-git-send-email-jerin.jacob@caviumnetworks.com> <3691745.y1f1NvKTEv@xps13> <20161124015912.GA13508@svelivela-lt.caveonetworks.com> <1883454.103LptOkIX@xps13> <20161125002334.GA21048@svelivela-lt.caveonetworks.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161125002334.GA21048@svelivela-lt.caveonetworks.com> Organization: Intel Research and =?iso-8859-1?Q?De=ACvel?= =?iso-8859-1?Q?opment?= Ireland Ltd. User-Agent: Mutt/1.7.1 (2016-10-04) Subject: Re: [dpdk-dev] [PATCH 1/4] eventdev: introduce event driven programming model X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Nov 2016 11:00:58 -0000 On Fri, Nov 25, 2016 at 05:53:34AM +0530, Jerin Jacob wrote: > On Thu, Nov 24, 2016 at 04:35:56PM +0100, Thomas Monjalon wrote: > > 2016-11-24 07:29, Jerin Jacob: > > > On Wed, Nov 23, 2016 at 07:39:09PM +0100, Thomas Monjalon wrote: > > > > 2016-11-18 11:14, Jerin Jacob: > > > > > +Eventdev API - EXPERIMENTAL > > > > > +M: Jerin Jacob > > > > > +F: lib/librte_eventdev/ > > > > > > > > OK to mark it experimental. > > > > What is the plan to remove the experimental word? > > > > > > IMO, EXPERIMENTAL status can be changed when > > > - At least two event drivers available(Intel and Cavium are working on > > > SW and HW event drivers) > > > - Functional test applications are fine with at least two drivers > > > - Portable example application to showcase the features of the library > > > - eventdev integration with another dpdk subsystem such as ethdev > > > > > > Thoughts?. I am not sure the criteria used in cryptodev case. > > > > Sounds good. > > We will be more confident when drivers and tests will be implemented. > > > > I think the roadmap for the SW driver targets the release 17.05. > > Do you still plan 17.02 for this API and the Cavium driver? > > No. 17.02 too short for up-streaming the Cavium driver.However, I think API and > skeleton event driver can go in 17.02 if there are no objections. > > > > > > > > +#define EVENTDEV_NAME_SKELETON_PMD event_skeleton > > > > > +/**< Skeleton event device PMD name */ > > > > > > > > I do not understand this #define. > > > > > > Applications can explicitly request the a specific driver though driver > > > name. This will go as argument to rte_event_dev_get_dev_id(const char *name). > > > The reason for keeping this #define in rte_eventdev.h is that, > > > application needs to include only rte_eventdev.h not rte_eventdev_pmd.h. > > > > So each driver must register its name in the API? > > Is it really needed? > > Otherwise how application knows the name of the driver. > The similar scheme used in cryptodev. > http://dpdk.org/browse/dpdk/tree/lib/librte_cryptodev/rte_cryptodev.h#n53 > No strong opinion here. Open for suggestions. > I like having a name registered. I think we need a scheme where an app can find and use an implementation using a specific driver. > > > > > > > +struct rte_event_dev_config { > > > > > + uint32_t dequeue_wait_ns; > > > > > + /**< rte_event_dequeue() wait for *dequeue_wait_ns* ns on this device. > > > > > > > > Please explain exactly when the wait occurs and why. > > > > > > Here is the explanation from rte_event_dequeue() API definition, > > > - > > > @param wait > > > 0 - no-wait, returns immediately if there is no event. > > > >0 - wait for the event, if the device is configured with > > > RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT then this function will wait until > > > the event available or *wait* time. > > > if the device is not configured with RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT > > > then this function will wait until the event available or *dequeue_wait_ns* > > > ^^^^^^^^^^^^^^^^^^^^^^ > > > ns which was previously supplied to rte_event_dev_configure() > > > - > > > This is provides the application to have control over, how long the > > > implementation should wait if event is not available. > > > > > > Let me know what exact changes are required if details are not enough in > > > rte_event_dequeue() API definition. > > > > Maybe that timeout would be a better name. > > It waits only if there is nothing in the queue. > > It can be interesting to highlight in this comment that this parameter > > makes the dequeue function a blocking call. > > OK. I will change to timeout then > > > > > > > > +/** Event port configuration structure */ > > > > > +struct rte_event_port_conf { > > > > > + int32_t new_event_threshold; > > > > > + /**< A backpressure threshold for new event enqueues on this port. > > > > > + * Use for *closed system* event dev where event capacity is limited, > > > > > + * and cannot exceed the capacity of the event dev. > > > > > + * Configuring ports with different thresholds can make higher priority > > > > > + * traffic less likely to be backpressured. > > > > > + * For example, a port used to inject NIC Rx packets into the event dev > > > > > + * can have a lower threshold so as not to overwhelm the device, > > > > > + * while ports used for worker pools can have a higher threshold. > > > > > + * This value cannot exceed the *nb_events_limit* > > > > > + * which previously supplied to rte_event_dev_configure() > > > > > + */ > > > > > + uint8_t dequeue_depth; > > > > > + /**< Configure number of bulk dequeues for this event port. > > > > > + * This value cannot exceed the *nb_event_port_dequeue_depth* > > > > > + * which previously supplied to rte_event_dev_configure() > > > > > + */ > > > > > + uint8_t enqueue_depth; > > > > > + /**< Configure number of bulk enqueues for this event port. > > > > > + * This value cannot exceed the *nb_event_port_enqueue_depth* > > > > > + * which previously supplied to rte_event_dev_configure() > > > > > + */ > > > > > +}; > > > > > > > > The depth configuration is not clear to me. > > > > > > Basically the maximum number of events can be enqueued/dequeued at time > > > from a given event port. depth of one == non burst mode. > > > > OK so depth is the queue size. Please could you reword? > > OK > > > > > > > > +/* Event types to classify the event source */ > > > > > > > > Why this classification is needed? > > > > > > This for application pipeling and the cases like, if application wants to know which > > > subsystem generated the event. > > > > > > example packet forwarding loop on the worker cores: > > > while(1) { > > > ev = dequeue() > > > // event from ethdev subsystem > > > if (ev.event_type == RTE_EVENT_TYPE_ETHDEV) { > > > - swap the mac address > > > - push to atomic queue for ingress flow order maintenance > > > by CORE > > > /* events from core */ > > > } else if (ev.event_type == RTE_EVENT_TYPE_CORE) { > > > > > > } > > > enqueue(ev); > > > } > > > > I don't know why but I feel this classification is weak. > > You need to track the source of the event. Does it make sense to go beyond > > and identify the source device? > > No, dequeue has dev_id argument, so event comes only from that device > > > > > > > > +#define RTE_EVENT_TYPE_ETHDEV 0x0 > > > > > +/**< The event generated from ethdev subsystem */ > > > > > +#define RTE_EVENT_TYPE_CRYPTODEV 0x1 > > > > > +/**< The event generated from crypodev subsystem */ > > > > > +#define RTE_EVENT_TYPE_TIMERDEV 0x2 > > > > > +/**< The event generated from timerdev subsystem */ > > > > > +#define RTE_EVENT_TYPE_CORE 0x3 > > > > > +/**< The event generated from core. > > > > > > > > What is core? > > > > > > The event are generated by lcore for pipeling. Any suggestion for > > > better name? lcore? > > > > What about CPU or SW? > > No strong opinion here. I will go with CPU then If you have no strong opinion, I think I'd prefer SW to CPU, as the main difference to my mind is that this comes from another SW entity rather than a hardware block. > > > > > > > > + /**< Opaque event pointer */ > > > > > + struct rte_mbuf *mbuf; > > > > > + /**< mbuf pointer if dequeued event is associated with mbuf */ > > > > > > > > How do we know that an event is associated with mbuf? > > > > > > By looking at the event source/type RTE_EVENT_TYPE_* > > > > > > > Does it mean that such events are always converted into mbuf even if the > > > > application does not need it? > > > > > > Hardware has dependency on getting physical address of the event, so any > > > struct that has "phys_addr_t buf_physaddr" works. > > > > I do not understand. > > In HW based implementations, the event pointer will be submitted to HW. > As you know, since HW can't understand the virtual address and it needs > to converted to the physical address, any DPDK object that provides phys_addr_t > such as mbuf can be used with libeventdev. > > > > > I tought that decoding the event would be the responsibility of the app > > by calling a function like > > rte_eventdev_convert_to_mbuf(struct rte_event *, struct rte_mbuf *). > > It can be. But it is costly.i.e Yet another function pointer based > driver interface on fastpath. Instead, if the driver itself can > convert to mbuf(in case of ETHDEV device) and tag the source/event type > as RTE_EVENT_TYPE_ETHDEV. > IMO the proposed schemed helps in SW based implementation as their no real > mbuf conversation. Something we can revisit in ethdev integration if > required. > > > > > > > > +struct rte_eventdev_driver; > > > > > +struct rte_eventdev_ops; > > > > > > > > I think it is better to split API and driver interface in two files. > > > > (we should do this split in ethdev) > > > > > > I thought so, but then the "static inline" versions of northbound > > > API(like rte_event_enqueue) will go another file(due to the fact that > > > implementation need to deference "dev->data->ports[port_id]"). Do you want that way? > > > I would like to keep all northbound API in rte_eventdev.h and not any of them > > > in rte_eventdev_pmd.h. > > > > My comment was confusing. > > You are doing 2 files, one for API (what you call northbound I think) > > and the other one for driver interface (what you call southbound I think), > > it's very fine. > > > > > > > +/** > > > > > + * Enqueue the event object supplied in the *rte_event* structure on an > > > > > + * event device designated by its *dev_id* through the event port specified by > > > > > + * *port_id*. The event object specifies the event queue on which this > > > > > + * event will be enqueued. > > > > > + * > > > > > + * @param dev_id > > > > > + * Event device identifier. > > > > > + * @param port_id > > > > > + * The identifier of the event port. > > > > > + * @param ev > > > > > + * Pointer to struct rte_event > > > > > + * > > > > > + * @return > > > > > + * - 0 on success > > > > > + * - <0 on failure. Failure can occur if the event port's output queue is > > > > > + * backpressured, for instance. > > > > > + */ > > > > > +static inline int > > > > > +rte_event_enqueue(uint8_t dev_id, uint8_t port_id, struct rte_event *ev) > > > > > > > > Is it really needed to have non-burst variant of enqueue/dequeue? > > > > > > Yes. certain HW can work only with non burst variants. > > > > Same comment as Bruce, we must keep only the burst variant. > > We cannot have different API for different HW. > > I don't think there is any portability issue here, I can explain. > > The application level, we have two more use case to deal with non burst > variant > > - latency critical work > - on dequeue, if application wants to deal with only one flow(i.e to > avoid processing two different application flows to avoid cache trashing) > > Selection of the burst variants will be based on > rte_event_dev_info_get() and rte_event_dev_configure()(see, max_event_port_dequeue_depth, > max_event_port_enqueue_depth, nb_event_port_dequeue_depth, nb_event_port_enqueue_depth ) > So I don't think their is portability issue here and I don't want to waste my > CPU cycles on the for loop if application known to be working with non > bursts variant like below > If the application is known to be working on non-burst varients, then they always request a burst-size of 1, and skip the loop completely. There is no extra performance hit in that case in either the app or the driver (since the non-burst driver always returns 1, irrespective of the number requested). > nb_events = rte_event_dequeue_burst(); > for(i=0; i < nb_events; i++){ > process ev[i] > } > > And mostly importantly the NPU can get almost same throughput > without burst variant so why not? > > > > > > > > +/** > > > > > + * Converts nanoseconds to *wait* value for rte_event_dequeue() > > > > > + * > > > > > + * If the device is configured with RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT flag then > > > > > + * application can use this function to convert wait value in nanoseconds to > > > > > + * implementations specific wait value supplied in rte_event_dequeue() > > > > > > > > Why is it implementation-specific? > > > > Why this conversion is not internal in the driver? > > > > > > This is for performance optimization, otherwise in drivers > > > need to convert ns to ticks in "fast path" > > > > So why not defining the unit of this timeout as CPU cycles like the ones > > returned by rte_get_timer_cycles()? > > Because HW co-processor can run in different clock domain. Need not be at > CPU frequency. > While I've no huge objection to this API, since it will not be implemented by our SW implementation, I'm just curious as to how much having this will save. How complicated is the arithmetic that needs to be done, and how many cycles on your platform is that going to take? /Bruce