From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id B80501B198 for ; Wed, 21 Feb 2018 15:35:33 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Feb 2018 06:35:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,543,1511856000"; d="scan'208";a="19249430" Received: from irsmsx102.ger.corp.intel.com ([163.33.3.155]) by orsmga007.jf.intel.com with ESMTP; 21 Feb 2018 06:35:29 -0800 Received: from irsmsx112.ger.corp.intel.com (10.108.20.5) by IRSMSX102.ger.corp.intel.com (163.33.3.155) with Microsoft SMTP Server (TLS) id 14.3.319.2; Wed, 21 Feb 2018 14:35:28 +0000 Received: from irsmsx101.ger.corp.intel.com ([169.254.1.188]) by irsmsx112.ger.corp.intel.com ([169.254.1.242]) with mapi id 14.03.0319.002; Wed, 21 Feb 2018 14:35:28 +0000 From: "Trahe, Fiona" To: Ahmed Mansour , "Verma, Shally" , "dev@dpdk.org" CC: "Athreya, Narayana Prasad" , "Gupta, Ashish" , "Sahu, Sunila" , "De Lara Guarch, Pablo" , "Challa, Mahipal" , "Jain, Deepak K" , Hemant Agrawal , Roy Pledge , Youri Querry , "Trahe, Fiona" Thread-Topic: [RFC v2] doc compression API for DPDK Thread-Index: AdOFUW8Wdt99b3u6RKydGSrxJwvtHglzRpVQ Date: Wed, 21 Feb 2018 14:35:27 +0000 Message-ID: <348A99DA5F5B7549AA880327E580B43589324E3A@IRSMSX101.ger.corp.intel.com> References: <348A99DA5F5B7549AA880327E580B435892F589D@IRSMSX101.ger.corp.intel.com> <348A99DA5F5B7549AA880327E580B43589315232@IRSMSX101.ger.corp.intel.com> <348A99DA5F5B7549AA880327E580B4358931F82B@IRSMSX101.ger.corp.intel.com> <348A99DA5F5B7549AA880327E580B43589321277@IRSMSX101.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYzZhYjIxYWMtYjFlZC00ZWVkLTkzNjQtYmQ1YTI0NDhiZjQ4IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6ImJQQmUzTmlXRnNKTkQxU3BwT3BqWG5JNzNJWDh0TWhKQXJLMHhFUmN3S3M9In0= x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC v2] doc compression API for DPDK X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Feb 2018 14:35:34 -0000 Hi Ahmed, Shally, > -----Original Message----- > From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com] > Sent: Tuesday, February 20, 2018 7:56 PM > To: Verma, Shally ; Trahe, Fiona ; dev@dpdk.org > Cc: Athreya, Narayana Prasad ; Gupta, = Ashish > ; Sahu, Sunila ; De Lara= Guarch, Pablo > ; Challa, Mahipal ; Jain, Deepak K > ; Hemant Agrawal ; Roy P= ledge > ; Youri Querry > Subject: Re: [RFC v2] doc compression API for DPDK >=20 > /// snip /// > >>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance > >>>>>>>>>>>>>>> ---------------------------------------------------------= ------ > >>>>>>>>>>>>>>> It is always an ideal expectation from application that i= t should parse > >>>>>>>>>>>>>> through all related chunk of source data making its mbuf-c= hain and > >>>>>>>>>>>> enqueue > >>>>>>>>>>>>>> it for stateless processing. > >>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burs= t() calls, then > >>>>>>>>>>>> an > >>>>>>>>>>>>>> expected call flow would be something like: > >>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |) > >>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will= call dequeue > >>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correc= t? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next > >>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeue= d. However > >>>>>>>>>>>> this illustration is specifically in > >>>>>>>>>>>>> context of stateful op processing to reflect if a stream is= broken into > >>>>>>>>>>>> chunks, then each chunk should be > >>>>>>>>>>>>> submitted as one op at-a-time with type =3D STATEFUL and ne= ed to be > >>>>>>>>>>>> dequeued first before next chunk is > >>>>>>>>>>>>> enqueued. > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |) > >>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next > >>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |) > >>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I und= erstand that > >>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we j= ust > >>>>>>>>>>>> distinguish > >>>>>>>>>>>>>> the response in exception cases? > >>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condi= tion is each op in > >>>>>>>>>>>> such case is independent of > >>>>>>>>>>>>> each other i.e. belong to different streams altogether. > >>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposa= l to process all > >>>>>>>>>>>> related chunks of data in single > >>>>>>>>>>>>> burst by passing them as ops array but later found that as = not-so-useful for > >>>>>>>>>>>> PMD handling for various > >>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments= for same. > >>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be= processed at a > >>>>>>>>>>>> time, since each needs the > >>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flig= ht at a time would > >>>>>>>>>>>> force PMDs to implement internal queueing and exception hand= ling for > >>>>>>>>>>>> OUT_OF_SPACE conditions you mention. > >>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make the= m > >>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little= bit more > >>>>>>>>>> complex but doable. > >>>>>>>>> [Fiona] In my opinion this is not doable, could be very ineffic= ient. > >>>>>>>>> There may be many streams. > >>>>>>>>> The PMD would have to have an internal queue per stream so > >>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SP= ACE case. > >>>>>>>>> And this may ripple back though all subsequent ops in the strea= m as each > >>>>>>>>> source len is increased and its dst buffer is not big enough. > >>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling. > >>>>>>>> The caller would still need to adjust > >>>>>>>> the src length/output buffer as you say. The PMD cannot handle > >>>>>>>> OUT_OF_SPACE internally. > >>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this= stream > >>>>>>>> until it gets explicit > >>>>>>>> confirmation from the caller to continue working on this stream.= Any ops > >>>>>>>> received by > >>>>>>>> the PMD should be returned to the caller with status STREAM_PAUS= ED since > >>>>>>>> the caller did not > >>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue= . > >>>>>>>> These semantics can be enabled by adding a new function to the A= PI > >>>>>>>> perhaps stream_resume(). > >>>>>>>> This allows the caller to indicate that it acknowledges that it = has seen > >>>>>>>> the issue and this op > >>>>>>>> should be used to resolve the issue. Implementations that do not= support > >>>>>>>> this mode of use > >>>>>>>> can push back immediately after one op is in flight. Implementat= ions > >>>>>>>> that support this use > >>>>>>>> mode can allow many ops from the same session > >>>>>>>> > >>>>>>> [Shally] Is it still in context of having single burst where all = op belongs to one stream? If yes, I > would > >>>> still > >>>>>>> say it would add an overhead to PMDs especially if it is expected= to work closer to HW (which I > think > >>>> is > >>>>>>> the case with DPDK PMD). > >>>>>>> Though your approach is doable but why this all cannot be in a la= yer above PMD? i.e. a layer > above > >>>> PMD > >>>>>>> can either pass one-op at a time with burst size =3D 1 OR can mak= e chained mbuf of input and > output > >>>> and > >>>>>>> pass than as one op. > >>>>>>> Is it just to ease applications of chained mbuf burden or do you = see any performance /use-case > >>>>>>> impacting aspect also? > >>>>>>> > >>>>>>> if it is in context where each op belong to different stream in a= burst, then why do we need > >>>>>>> stream_pause and resume? It is a expectations from app to pass mo= re output buffer with > consumed > >>>> + 1 > >>>>>>> from next call onwards as it has already > >>>>>>> seen OUT_OF_SPACE. > >>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD > >>>>> implementation rejects all ops that belong to a stream that has ent= ered > >>>>> "RECOVERABLE" state for one reason or another. The caller must > >>>>> acknowledge explicitly that it has received news of the problem bef= ore > >>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree wit= h you > >>>>> that implementing this functionality in the software layer above th= e PMD > >>>>> is a bad idea since the latency reductions are lost. > >>>> [Shally] Just reiterating, I rather meant other way around i.e. I se= e it easier to put all such complexity > in a > >>>> layer above PMD. > >>>> > >>>>> This setup is useful in latency sensitive applications where the la= tency > >>>>> of buffering multiple ops into one op is significant. We found late= ncy > >>>>> makes a significant difference in search applications where the PMD > >>>>> competes with software decompression. > >>> [Fiona] I see, so when all goes well, you get best-case latency, but = when > >>> out-of-space occurs latency will probably be worse. > >> [Ahmed] This is exactly right. This use mode assumes out-of-space is a > >> rare occurrence. Recovering from it should take similar time to > >> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE = in > >> both sync and async use. The caller can fix up the op and send it back > >> to the PMD to continue work just as would be done in sync. Nonetheless= , > >> the added complexity is not justifiable if out-of-space is very common > >> since the recoverable state will be the limiting factor that forces > >> synchronicity. > >>>>>> [Fiona] I still have concerns with this and would not want to supp= ort in our PMD. > >>>>>> TO make sure I understand, you want to send a burst of ops, with s= everal from same stream. > >>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not pr= ocess any > >>>>>> subsequent ops in that stream. > >>>>>> Should it return them in a dequeue_burst() with status still NOT_P= ROCESSED? > >>>>>> Or somehow drop them? How? > >>>>>> While still processing ops form other streams. > >>>>> [Ahmed] This is exactly correct. It should return them with > >>>>> NOT_PROCESSED. Yes, the PMD should continue processing other stream= s. > >>>>>> As we want to offload each op to hardware with as little CPU proce= ssing as possible we > >>>>>> would not want to open up each op to see which stream it's attache= d to and > >>>>>> make decisions to do per-stream storage, or drop it, or bypass hw = and dequeue without > processing. > >>>>> [Ahmed] I think I might have missed your point here, but I will try= to > >>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send= ops > >>>>> to the PMD and the PMD should reject until stream_continue() is cal= led. > >>>>> The next op to be sent by the user will have a special marker in it= to > >>>>> inform the PMD to continue working on this stream. Alternatively th= e > >>>>> DPDK layer can be made "smarter" to fail during the enqueue by chec= king > >>>>> the stream and its state, but like you say this adds additional CPU > >>>>> overhead during the enqueue. > >>>>> I am curious. In a simple synchronous use case. How do we prevent u= sers > >>>> >from putting multiple ops in flight that belong to a single stream?= Do > >>>>> we just currently say it is undefined behavior? Otherwise we would = have > >>>>> to check the stream and incur the CPU overhead. > >>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on da= ta path in > >>> DPDK model we expect good behaviour and don't have to error check for= things like this. > >> [Ahmed] This makes sense. We also assume good behavior. > >>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the = hw q, then > >>> build and send those messages. If we found an op from a stream which = already > >>> had one inflight, we'd have to hold that back, store in a sw stream-s= pecific holding queue, > >>> only send 19 to hw. We cannot send multiple ops from same stream to > >>> the hw as it fans them out and does them in parallel. > >>> Once the enqueue_burst() returns, there is no processing > >>> context which would spot that the first has completed > >>> and send the next op to the hw. On a dequeue_burst() we would spot th= is, > >>> in that context could process the next op in the stream. > >>> On out of space, instead of processing the next op we would have to t= ransfer > >>> all unprocessed ops from the stream to the dequeue result. > >>> Some parts of this are doable, but seems likely to add a lot more lat= ency, > >>> we'd need to add extra threads and timers to move ops from the sw > >>> queue to the hw q to get any benefit, and these constructs would add > >>> context switching and CPU cycles. So we prefer to push this responsib= ility > >>> to above the API and it can achieve similar. > >> [Ahmed] I see what you mean. Our workflow is almost exactly the same > >> with our hardware, but the fanning out is done by the hardware based o= n > >> the stream and ops that belong to the same stream are never allowed to > >> go out of order. Otherwise the data would be corrupted. Likewise the > >> hardware is responsible for checking the state of the stream and > >> returning frames as NOT_PROCESSED to the software > >>>>>> Maybe we could add a capability if this behaviour is important for= you? > >>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ? > >>>>>> Our PMD would set this to 0. And expect no more than one op from a= stateful stream > >>>>>> to be in flight at any time. > >>>>> [Ahmed] That makes sense. This way the different DPDK implementatio= ns do > >>>>> not have to add extra checking for unsupported cases. > >>>> [Shally] @ahmed, If I summarise your use-case, this is how to want t= o PMD to support? > >>>> - a burst *carry only one stream* and all ops then assumed to be bel= ong to that stream? (please > note, > >>>> here burst is not carrying more than one stream) > >> [Ahmed] No. In this use case the caller sets up an op and enqueues a > >> single op. Then before the response comes back from the PMD the caller > >> enqueues a second op on the same stream. > >>>> -PMD will submit one op at a time to HW? > >> [Ahmed] I misunderstood what PMD means. I used it throughout to mean t= he > >> HW. I used DPDK to mean the software implementation that talks to the > >> hardware. > >> The software will submit all ops immediately. The hardware has to figu= re > >> out what to do with the ops depending on what stream they belong to. > >>>> -if processed successfully, push it back to completion queue with st= atus =3D SUCCESS. If failed or run to > >>>> into OUT_OF_SPACE, then push it to completion queue with status =3D = FAILURE/ > >>>> OUT_OF_SPACE_RECOVERABLE and rest with status =3D NOT_PROCESSED and = return with enqueue > count > >>>> =3D total # of ops submitted originally with burst? > >> [Ahmed] This is exactly what I had in mind. all ops will be submitted = to > >> the HW. The HW will put all of them on the completion queue with the > >> correct status exactly as you say. > >>>> -app assumes all have been enqueued, so it go and dequeue all ops > >>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst= of ops with call to > >>>> stream_continue/resume API starting from op which encountered OUT_OF= _SPACE and others as > >>>> NOT_PROCESSED with updated input and output buffer? > >> [Ahmed] Correct this is what we do today in our proprietary API. > >>>> -repeat until *all* are dequeued with status =3D SUCCESS or *any* wi= th status =3D FAILURE? If anytime > >>>> failure is seen, then app start whole processing all over again or j= ust drop this burst?! > >> [Ahmed] The app has the choice on how to proceed. If the issue is > >> recoverable then the application can continue this stream from where i= t > >> stopped. if the failure is unrecoverable then the application should > >> first fix the problem and start from the beginning of the stream. > >>>> If all of above is true, then I think we should add another API such= as > rte_comp_enque_single_stream() > >>>> which will be functional under Feature Flag =3D ALLOW_ENQUEUE_MULTIP= LE_STATEFUL_OPS or better > >>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?! > >> [Ahmed] The main advantage in async use is lost if we force all relate= d > >> ops to be in the same burst. if we do that, then we might as well merg= e > >> all the ops into one op. That would reduce the overhead. > >> The use mode I am proposing is only useful in cases where the data > >> becomes available after the first enqueue occurred. I want to allow th= e > >> caller to enqueue the second set of data as soon as it is available > >> regardless of whether or not the HW has already started working on the > >> first op inflight. > > [Shally] @ahmed, Ok.. seems I missed a point here. So, confirm me foll= owing: > > > > As per current description in doc, expected stateful usage is: > > enqueue (op1) --> dequeue(op1) --> enqueue(op2) > > > > but you're suggesting to allow an option to change it to > > > > enqueue(op1) -->enqueue(op2) > > > > i.e. multiple ops from same stream can be put in-flight via subsequent= enqueue_burst() calls without > waiting to dequeue previous ones as PMD support it . So, no change to cur= rent definition of a burst. It will > still carry multiple streams where each op belonging to different stream = ?! > [Ahmed] Correct. I guess a user could put two ops on the same burst that > belong to the same stream. In that case it would be more efficient to > merge the ops using scatter gather. Nonetheless, I would not add checks > in my implementation to limit that use. The hardware does not perceive a > difference between ops that came on one burst and ops that came on two > different bursts. to the hardware they are all ops. What matters is > which stream each op belongs to. > > if yes, then seems your HW can be setup for multiple streams so it is e= fficient for your case to support it > in DPDK PMD layer but our hw doesn't by-default and need SW to back it. G= iven that, I also suggest to > enable it under some feature flag. > > > > However it looks like an add-on and if it doesn't change current defini= tion of a burst and minimum > expectation set on stateful processing described in this document, then I= MO, you can propose this feature > as an incremental patch on baseline version, in absence of which, > > application will exercise stateful processing as described here (enq->d= eq->enq). Thoughts? > [Ahmed] Makes sense. I was worried that there might be fundamental > limitations to this mode of use in the API design. That is why I wanted > to share this use mode with you guys and see if it can be accommodated > using an incremental patch in the future. > >>> [Fiona] Am curious about Ahmed's response to this. I didn't get that = a burst should carry only one > stream > >>> Or get how this makes a difference? As there can be many enqueue_burs= t() calls done before an > dequeue_burst() > >>> Maybe you're thinking the enqueue_burst() would be a blocking call th= at would not return until all the > ops > >>> had been processed? This would turn it into a synchronous call which = isn't the intent. > >> [Ahmed] Agreed, a blocking or even a buffering software layer that bab= y > >> sits the hardware does not fundamentally change the parameters of the > >> system as a whole. It just moves workflow management complexity down > >> into the DPDK software layer. Rather there are real latency and > >> throughput advantages (because of caching) that I want to expose. > >> [Fiona] ok, so I think we've agreed that this can be an option, as long as = not required of PMDs and enabled under an explicit capability - named something like ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS @Ahmed, we'll leave it up to you to define details. What's necessary is API text to describe the expected behaviour on any erro= r conditions, the pause/resume API, whether an API is expected to clean up if resume does= n't happen and if there's any time limit on this, etc But I wouldn't expect any changes to existing burst APIs, and all PMDs and = applications must be able to handle the default behaviour, i.e. with this capability dis= abled. Specifically even if a PMD has this capability, if an application ignores i= t and only sends one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream shou= ld not be in a paused state and the PMD should not wait for a resume() to hand= le the=20 next op sent for that stream. Does that make sense? > >> /// snip /// > > > >