DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC v2] doc compression API for DPDK
@ 2018-01-04 11:45 Verma, Shally
  2018-01-09 19:07 ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-01-04 11:45 UTC (permalink / raw)
  To: Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Ahmed Mansour

This is an RFC v2 document to brief understanding and requirements on compression API proposal in DPDK. It is based on "[RFC v3] Compression API in DPDK http://dpdk.org/dev/patchwork/patch/32331/ ".
Intention of this document is to align on concepts built into compression API, its usage and identify further requirements. 

Going further it could be a base to Compression Module Programmer Guide.

Current scope is limited to
- definition of the terminology which makes up foundation of compression API
- typical API flow expected to use by applications
- Stateless and Stateful operation definition and usage after RFC v1 doc review http://dev.dpdk.narkive.com/CHS5l01B/dpdk-dev-rfc-v1-doc-compression-api-for-dpdk
 
1. Overview
~~~~~~~~~~~

A. Compression Methodologies in compression API
===========================================
DPDK compression supports two types of compression methodologies:
- Stateless - each data object is compressed individually without any reference to previous data, 
- Stateful -  each data object is compressed with reference to previous data object i.e. history of data is needed for compression / decompression
For more explanation, please refer RFC https://www.ietf.org/rfc/rfc1951.txt

To support both methodologies, DPDK compression introduces two key concepts: Session and Stream.

B. Notion of a session in compression API
================================== 
A Session in DPDK compression is a logical entity which is setup one-time with immutable parameters i.e. parameters that don't change across operations and devices.
A session can be shared across multiple devices and multiple operations simultaneously. 
A typical Session parameters includes info such as:
- compress / decompress
- compression algorithm and associated configuration parameters

Application can create different sessions on a device initialized with same/different xforms. Once a session is initialized with one xform it cannot be re-initialized.
 
C. Notion of stream in compression API
 =======================================
Unlike session which carry common set of information across operations, a stream in DPDK compression is a logical entity which identify related set of operations and carry operation specific information as needed by device during its processing.
It is device specific data structure which is opaque to application, setup and maintained by device. 

A stream can be used with *only* one op at a time i.e. no two operations can share same stream simultaneously.
A stream is *must* for stateful ops processing and optional for stateless (Please see respective sections for more details).

This enables sharing of a session by multiple threads handling different data set as each op carry its own context (internal states, history buffers et el) in its attached stream. 
Application should call rte_comp_stream_create() and attach to op before beginning of  operation processing and free via rte_comp_stream_free() after its complete.

C. Notion of burst operations in compression API
 =======================================
A burst in DPDK compression is an array of operations where each op carry independent set of data. i.e. a burst can look like:

                                      ---------------------------------------------------------------------------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |)
                                       ---------------------------------------------------------------------------------------------------------

Where, op1 .. op5 are all independent of each other and carry entirely different set of data. 
Each op can be attached to same/different session but *must* be attached to different stream.

Each op (struct rte_comp_op) carry compression/decompression operational parameter and is both an input/output parameter. 
PMD gets source, destination and checksum information at input and update it with bytes consumed and produced and checksum at output.

Since each operation in a burst is independent and thus can complete out-of-order,  applications which need ordering, should setup per-op user data area with reordering information so that it can determine enqueue order at deque.

Also if multiple threads calls enqueue_burst() on same queue pair then it’s application onus to use proper locking mechanism to ensure exclusive enqueuing of operations.

D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD to reflect its support for Stateful operation. Each op carry an op type indicating if it's to be processed stateful or stateless.
 
D.1 Compression API Stateless operation
------------------------------------------------------ 
An op is processed stateless if it has
-              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL (required only on compression side),
-	 op_type set to RTE_COMP_OP_STATELESS
-              All-of the required input and sufficient large output buffer to store output i.e. OUT_OF_SPACE can never occur.
 
When all of the above conditions are met, PMD initiates stateless processing and releases acquired resources after processing of current operation is complete i.e. full input consumed and full output written.
Application can optionally attach a stream to such ops. In such case, application must attach different stream to each op.

Application can enqueue stateless burst via making consecutive enque_burst() calls i.e. Following is relevant usage:
 
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops); 
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);  
 
*Note – Every call has different ops array i.e.  same rte_comp_op array *cannot be re-enqueued* to process next batch of data until previous ones are completely processed.

D.1.1 Stateless and OUT_OF_SPACE 
------------------------------------------------
OUT_OF_SPACE is a condition when output buffer runs out of space and where PMD still has more data to produce. If PMD run into such condition, then it's an error condition in stateless processing.
In such case, PMD resets itself and return with status RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0 i.e. no input read, no output written.
Application can resubmit an full input with larger output buffer size.

D.2 Compression API Stateful operation
----------------------------------------------------------
 A Stateful operation in DPDK compression means application invokes enqueue burst() multiple times to process related chunk of data either because 
- Application broke data into several ops, and/or
- PMD ran into out_of_space situation during input processing

In case of either one or all of the above conditions, PMD is required to maintain state of op across enque_burst() calls and
ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value RTE_COMP_FULL/FINAL_FLUSH.

D.2.1 Stateful operation state maintenance
---------------------------------------------------------------
It is always an ideal expectation from application that it should parse through all related chunk of source data making its mbuf-chain and enqueue it for stateless processing.
However, if it need to break it into several enqueue_burst() calls, then an expected call flow would be something like:

enqueue_burst( |op.no_flush |)
deque_burst(op) // should dequeue before we enqueue next
enqueue_burst( |op.no_flush |)
deque_burst(op) // should dequeue before we enqueue next
enqueue_burst( |op.full_flush |)

Here an op *must* be attached to a stream and every subsequent enqueue_burst() call should carry *same* stream. Since PMD maintain ops state in stream, thus it is mandatory for application to attach stream to such ops.

D.2.2 Stateful and Out_of_Space
--------------------------------------------
If PMD support stateful and run into OUT_OF_SPACE situation, then it is not an error condition for PMD. In such case, PMD return with status RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of input bytes read and produced = length of complete output buffer.
Application should enqueue op with source starting at consumed+1 and output buffer with available space.
           
D.2.3 Sliding Window Size
------------------------------------
Every PMD will reflect in its algorithm capability structure maximum length of Sliding Window in bytes which would indicate maximum history buffer length used by algo.

2. Example API illustration
~~~~~~~~~~~~~~~~~~~~~~~

Following is an illustration on API usage  (This is just one flow, other variants are also possible):
1. rte_comp_session *sess = rte_compressdev_session_create (rte_mempool *pool);  
2. rte_compressdev_session_init (int dev_id, rte_comp_session *sess, rte_comp_xform *xform, rte_mempool *sess_pool);  
3. rte_comp_op_pool_create(rte_mempool ..)  
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct rte_comp_op **ops, uint16_t nb_ops);  
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session (rte_comp_op *op, rte_comp_session *sess); 
    5.2 op.op_type = RTE_COMP_OP_STATELESS
    5.3 op.flush = RTE_FLUSH_FINAL
6. [Optional] for every rte_comp_op in ops[],
    6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess, void **stream); 
    6.2 rte_comp_op_attach_stream(rte_comp_op *op, rte_comp_session *stream);
7.for every rte_comp_op in ops[],
     7.1 set up with src/dst buffer
8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops, nb_ops); 
9. do while (dqu < enq) // Wait till all of enqueued are dequeued 
    9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops, enq);
10. Repeat 7 for next batch of data  
11. for every ops in ops[]
      11.1 rte_comp_stream_free(op->stream);
11. rte_comp_session_clear (sess) ;
12. rte_comp_session_terminate(ret_comp_sess *session)

Thanks
Shally


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-04 11:45 [dpdk-dev] [RFC v2] doc compression API for DPDK Verma, Shally
@ 2018-01-09 19:07 ` Ahmed Mansour
  2018-01-10 12:55   ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-01-09 19:07 UTC (permalink / raw)
  To: Verma, Shally, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi Shally,

Thanks for the summary. It is very helpful. Please see comments below


On 1/4/2018 6:45 AM, Verma, Shally wrote:
> This is an RFC v2 document to brief understanding and requirements on compression API proposal in DPDK. It is based on "[RFC v3] Compression API in DPDK https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpdk.org%2Fdev%2Fpatchwork%2Fpatch%2F32331%2F&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=JFtOnJxajgXX7s3DMZ79K7VVM7TXO8lBd6rNeVlsHDg%3D&reserved=0 ".
> Intention of this document is to align on concepts built into compression API, its usage and identify further requirements. 
>
> Going further it could be a base to Compression Module Programmer Guide.
>
> Current scope is limited to
> - definition of the terminology which makes up foundation of compression API
> - typical API flow expected to use by applications
> - Stateless and Stateful operation definition and usage after RFC v1 doc review https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdev.dpdk.narkive.com%2FCHS5l01B%2Fdpdk-dev-rfc-v1-doc-compression-api-for-dpdk&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=Fy7xKIyxZX97i7vEM6NqgrvnqKrNrWOYLwIA5dEHQNQ%3D&reserved=0
>  
> 1. Overview
> ~~~~~~~~~~~
>
> A. Compression Methodologies in compression API
> ===========================================
> DPDK compression supports two types of compression methodologies:
> - Stateless - each data object is compressed individually without any reference to previous data, 
> - Stateful -  each data object is compressed with reference to previous data object i.e. history of data is needed for compression / decompression
> For more explanation, please refer RFC https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ietf.org%2Frfc%2Frfc1951.txt&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=pfp2VX1w3UxH5YLcL2R%2BvKXNeS7jP46CsASq0B1SETw%3D&reserved=0
>
> To support both methodologies, DPDK compression introduces two key concepts: Session and Stream.
>
> B. Notion of a session in compression API
> ================================== 
> A Session in DPDK compression is a logical entity which is setup one-time with immutable parameters i.e. parameters that don't change across operations and devices.
> A session can be shared across multiple devices and multiple operations simultaneously. 
> A typical Session parameters includes info such as:
> - compress / decompress
> - compression algorithm and associated configuration parameters
>
> Application can create different sessions on a device initialized with same/different xforms. Once a session is initialized with one xform it cannot be re-initialized.
>  
> C. Notion of stream in compression API
>  =======================================
> Unlike session which carry common set of information across operations, a stream in DPDK compression is a logical entity which identify related set of operations and carry operation specific information as needed by device during its processing.
> It is device specific data structure which is opaque to application, setup and maintained by device. 
>
> A stream can be used with *only* one op at a time i.e. no two operations can share same stream simultaneously.
> A stream is *must* for stateful ops processing and optional for stateless (Please see respective sections for more details).
>
> This enables sharing of a session by multiple threads handling different data set as each op carry its own context (internal states, history buffers et el) in its attached stream. 
> Application should call rte_comp_stream_create() and attach to op before beginning of  operation processing and free via rte_comp_stream_free() after its complete.
>
> C. Notion of burst operations in compression API
>  =======================================
> A burst in DPDK compression is an array of operations where each op carry independent set of data. i.e. a burst can look like:
>
>                                       ---------------------------------------------------------------------------------------------------------
>               enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |)
>                                        ---------------------------------------------------------------------------------------------------------
>
> Where, op1 .. op5 are all independent of each other and carry entirely different set of data. 
> Each op can be attached to same/different session but *must* be attached to different stream.
>
> Each op (struct rte_comp_op) carry compression/decompression operational parameter and is both an input/output parameter. 
> PMD gets source, destination and checksum information at input and update it with bytes consumed and produced and checksum at output.
>
> Since each operation in a burst is independent and thus can complete out-of-order,  applications which need ordering, should setup per-op user data area with reordering information so that it can determine enqueue order at deque.
>
> Also if multiple threads calls enqueue_burst() on same queue pair then it’s application onus to use proper locking mechanism to ensure exclusive enqueuing of operations.
>
> D. Stateless Vs Stateful
> ===================
> Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD to reflect its support for Stateful operation. Each op carry an op type indicating if it's to be processed stateful or stateless.
>  
> D.1 Compression API Stateless operation
> ------------------------------------------------------ 
> An op is processed stateless if it has
> -              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL (required only on compression side),
> -	 op_type set to RTE_COMP_OP_STATELESS
> -              All-of the required input and sufficient large output buffer to store output i.e. OUT_OF_SPACE can never occur.
>  
> When all of the above conditions are met, PMD initiates stateless processing and releases acquired resources after processing of current operation is complete i.e. full input consumed and full output written.
> Application can optionally attach a stream to such ops. In such case, application must attach different stream to each op.
>
> Application can enqueue stateless burst via making consecutive enque_burst() calls i.e. Following is relevant usage:
>  
> enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops); 
> enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);  
>  
> *Note – Every call has different ops array i.e.  same rte_comp_op array *cannot be re-enqueued* to process next batch of data until previous ones are completely processed.
>
> D.1.1 Stateless and OUT_OF_SPACE 
> ------------------------------------------------
> OUT_OF_SPACE is a condition when output buffer runs out of space and where PMD still has more data to produce. If PMD run into such condition, then it's an error condition in stateless processing.
> In such case, PMD resets itself and return with status RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0 i.e. no input read, no output written.
> Application can resubmit an full input with larger output buffer size.

[Ahmed] Can we add an option to allow the user to read the data that was produced while still reporting OUT_OF_SPACE? this is mainly useful for decompression applications doing search.

> D.2 Compression API Stateful operation
> ----------------------------------------------------------
>  A Stateful operation in DPDK compression means application invokes enqueue burst() multiple times to process related chunk of data either because 
> - Application broke data into several ops, and/or
> - PMD ran into out_of_space situation during input processing
>
> In case of either one or all of the above conditions, PMD is required to maintain state of op across enque_burst() calls and
> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value RTE_COMP_FULL/FINAL_FLUSH.
>
> D.2.1 Stateful operation state maintenance
> ---------------------------------------------------------------
> It is always an ideal expectation from application that it should parse through all related chunk of source data making its mbuf-chain and enqueue it for stateless processing.
> However, if it need to break it into several enqueue_burst() calls, then an expected call flow would be something like:
>
> enqueue_burst( |op.no_flush |)

[Ahmed] The work is now in flight to the PMD.The user will call dequeue burst in a loop until all ops are received. Is this correct?

> deque_burst(op) // should dequeue before we enqueue next
> enqueue_burst( |op.no_flush |)
> deque_burst(op) // should dequeue before we enqueue next
> enqueue_burst( |op.full_flush |)

[Ahmed] Why now allow multiple work items in flight? I understand that occasionaly there will be OUT_OF_SPACE exception. Can we just distinguish the response in exception cases?

>
> Here an op *must* be attached to a stream and every subsequent enqueue_burst() call should carry *same* stream. Since PMD maintain ops state in stream, thus it is mandatory for application to attach stream to such ops.
>
> D.2.2 Stateful and Out_of_Space
> --------------------------------------------
> If PMD support stateful and run into OUT_OF_SPACE situation, then it is not an error condition for PMD. In such case, PMD return with status RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of input bytes read and produced = length of complete output buffer.
> Application should enqueue op with source starting at consumed+1 and output buffer with available space.

[Ahmed] Related to OUT_OF_SPACE. What status does the user recieve in a decompression case when the end block is encountered before the end of the input? Does the PMD continue decomp? Does it stop there and return the stop index?

>            
> D.2.3 Sliding Window Size
> ------------------------------------
> Every PMD will reflect in its algorithm capability structure maximum length of Sliding Window in bytes which would indicate maximum history buffer length used by algo.
>
> 2. Example API illustration
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> Following is an illustration on API usage  (This is just one flow, other variants are also possible):
> 1. rte_comp_session *sess = rte_compressdev_session_create (rte_mempool *pool);  
> 2. rte_compressdev_session_init (int dev_id, rte_comp_session *sess, rte_comp_xform *xform, rte_mempool *sess_pool);  
> 3. rte_comp_op_pool_create(rte_mempool ..)  
> 4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct rte_comp_op **ops, uint16_t nb_ops);  
> 5. for every rte_comp_op in ops[],
>     5.1 rte_comp_op_attach_session (rte_comp_op *op, rte_comp_session *sess); 
>     5.2 op.op_type = RTE_COMP_OP_STATELESS
>     5.3 op.flush = RTE_FLUSH_FINAL
> 6. [Optional] for every rte_comp_op in ops[],
>     6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess, void **stream); 
>     6.2 rte_comp_op_attach_stream(rte_comp_op *op, rte_comp_session *stream);

[Ahmed] What is the semantic effect of attaching a stream to every op? will this application benefit for this given that it is setup with op_type STATELESS

> 7.for every rte_comp_op in ops[],
>      7.1 set up with src/dst buffer
> 8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops, nb_ops); 
> 9. do while (dqu < enq) // Wait till all of enqueued are dequeued 
>     9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops, enq);

[Ahmed] I am assuming that waiting for all enqueued to be dequeued is not strictly necessary, but is just the chosen example in this case

> 10. Repeat 7 for next batch of data  
> 11. for every ops in ops[]
>       11.1 rte_comp_stream_free(op->stream);
> 11. rte_comp_session_clear (sess) ;
> 12. rte_comp_session_terminate(ret_comp_sess *session)
>
> Thanks
> Shally
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-09 19:07 ` Ahmed Mansour
@ 2018-01-10 12:55   ` Verma, Shally
  2018-01-11 18:53     ` Trahe, Fiona
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-01-10 12:55 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

HI Ahmed

> -----Original Message-----
> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> Sent: 10 January 2018 00:38
> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona
> <fiona.trahe@intel.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>;
> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: Re: [RFC v2] doc compression API for DPDK
> 
> Hi Shally,
> 
> Thanks for the summary. It is very helpful. Please see comments below
> 
> 
> On 1/4/2018 6:45 AM, Verma, Shally wrote:
> > This is an RFC v2 document to brief understanding and requirements on
> compression API proposal in DPDK. It is based on "[RFC v3] Compression API
> in DPDK
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpd
> k.org%2Fdev%2Fpatchwork%2Fpatch%2F32331%2F&data=02%7C01%7Cahm
> ed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea
> 1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=JF
> tOnJxajgXX7s3DMZ79K7VVM7TXO8lBd6rNeVlsHDg%3D&reserved=0 ".
> > Intention of this document is to align on concepts built into compression
> API, its usage and identify further requirements.
> >
> > Going further it could be a base to Compression Module Programmer
> Guide.
> >
> > Current scope is limited to
> > - definition of the terminology which makes up foundation of compression
> API
> > - typical API flow expected to use by applications
> > - Stateless and Stateful operation definition and usage after RFC v1 doc
> review
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdev.
> dpdk.narkive.com%2FCHS5l01B%2Fdpdk-dev-rfc-v1-doc-compression-api-
> for-
> dpdk&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473
> fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6
> 36506631207323264&sdata=Fy7xKIyxZX97i7vEM6NqgrvnqKrNrWOYLwIA5dEH
> QNQ%3D&reserved=0
> >
> > 1. Overview
> > ~~~~~~~~~~~
> >
> > A. Compression Methodologies in compression API
> > ===========================================
> > DPDK compression supports two types of compression methodologies:
> > - Stateless - each data object is compressed individually without any
> reference to previous data,
> > - Stateful -  each data object is compressed with reference to previous data
> object i.e. history of data is needed for compression / decompression
> > For more explanation, please refer RFC
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> ww.ietf.org%2Frfc%2Frfc1951.txt&data=02%7C01%7Cahmed.mansour%40nx
> p.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd9
> 9c5c301635%7C0%7C0%7C636506631207323264&sdata=pfp2VX1w3UxH5YLcL
> 2R%2BvKXNeS7jP46CsASq0B1SETw%3D&reserved=0
> >
> > To support both methodologies, DPDK compression introduces two key
> concepts: Session and Stream.
> >
> > B. Notion of a session in compression API
> > ==================================
> > A Session in DPDK compression is a logical entity which is setup one-time
> with immutable parameters i.e. parameters that don't change across
> operations and devices.
> > A session can be shared across multiple devices and multiple operations
> simultaneously.
> > A typical Session parameters includes info such as:
> > - compress / decompress
> > - compression algorithm and associated configuration parameters
> >
> > Application can create different sessions on a device initialized with
> same/different xforms. Once a session is initialized with one xform it cannot
> be re-initialized.
> >
> > C. Notion of stream in compression API
> >  =======================================
> > Unlike session which carry common set of information across operations, a
> stream in DPDK compression is a logical entity which identify related set of
> operations and carry operation specific information as needed by device
> during its processing.
> > It is device specific data structure which is opaque to application, setup and
> maintained by device.
> >
> > A stream can be used with *only* one op at a time i.e. no two operations
> can share same stream simultaneously.
> > A stream is *must* for stateful ops processing and optional for stateless
> (Please see respective sections for more details).
> >
> > This enables sharing of a session by multiple threads handling different
> data set as each op carry its own context (internal states, history buffers et
> el) in its attached stream.
> > Application should call rte_comp_stream_create() and attach to op before
> beginning of  operation processing and free via rte_comp_stream_free()
> after its complete.
> >
> > C. Notion of burst operations in compression API
> >  =======================================
> > A burst in DPDK compression is an array of operations where each op carry
> independent set of data. i.e. a burst can look like:
> >
> >                                       ---------------------------------------------------------------------
> ------------------------------------
> >               enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
> op4.no_flush | op5.no_flush |)
> >                                        --------------------------------------------------------------------
> -------------------------------------
> >
> > Where, op1 .. op5 are all independent of each other and carry entirely
> different set of data.
> > Each op can be attached to same/different session but *must* be attached
> to different stream.
> >
> > Each op (struct rte_comp_op) carry compression/decompression
> operational parameter and is both an input/output parameter.
> > PMD gets source, destination and checksum information at input and
> update it with bytes consumed and produced and checksum at output.
> >
> > Since each operation in a burst is independent and thus can complete out-
> of-order,  applications which need ordering, should setup per-op user data
> area with reordering information so that it can determine enqueue order at
> deque.
> >
> > Also if multiple threads calls enqueue_burst() on same queue pair then it's
> application onus to use proper locking mechanism to ensure exclusive
> enqueuing of operations.
> >
> > D. Stateless Vs Stateful
> > ===================
> > Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD
> to reflect its support for Stateful operation. Each op carry an op type
> indicating if it's to be processed stateful or stateless.
> >
> > D.1 Compression API Stateless operation
> > ------------------------------------------------------
> > An op is processed stateless if it has
> > -              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
> (required only on compression side),
> > -	 op_type set to RTE_COMP_OP_STATELESS
> > -              All-of the required input and sufficient large output buffer to store
> output i.e. OUT_OF_SPACE can never occur.
> >
> > When all of the above conditions are met, PMD initiates stateless
> processing and releases acquired resources after processing of current
> operation is complete i.e. full input consumed and full output written.
> > Application can optionally attach a stream to such ops. In such case,
> application must attach different stream to each op.
> >
> > Application can enqueue stateless burst via making consecutive
> enque_burst() calls i.e. Following is relevant usage:
> >
> > enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
> > enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
> >
> > *Note - Every call has different ops array i.e.  same rte_comp_op array
> *cannot be re-enqueued* to process next batch of data until previous ones
> are completely processed.
> >
> > D.1.1 Stateless and OUT_OF_SPACE
> > ------------------------------------------------
> > OUT_OF_SPACE is a condition when output buffer runs out of space and
> where PMD still has more data to produce. If PMD run into such condition,
> then it's an error condition in stateless processing.
> > In such case, PMD resets itself and return with status
> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0 i.e.
> no input read, no output written.
> > Application can resubmit an full input with larger output buffer size.
> 
> [Ahmed] Can we add an option to allow the user to read the data that was
> produced while still reporting OUT_OF_SPACE? this is mainly useful for
> decompression applications doing search.

[Shally] It is there but applicable for stateful operation type (please refer to handling out_of_space under "Stateful Section").
By definition, "stateless" here means that application (such as IPCOMP) knows maximum output size guaranteedly and ensure that uncompressed data size cannot grow more than provided output buffer.
Such apps can submit an op with type = STATELESS and provide full input, then PMD assume it has sufficient input and output and thus doesn't need to maintain any contexts after op is processed. 
If application doesn't know about max output size, then it should process it as stateful op i.e. setup op with type = STATEFUL and attach a stream so that PMD can maintain relevant context to handle such condition.

> 
> > D.2 Compression API Stateful operation
> > ----------------------------------------------------------
> >  A Stateful operation in DPDK compression means application invokes
> enqueue burst() multiple times to process related chunk of data either
> because
> > - Application broke data into several ops, and/or
> > - PMD ran into out_of_space situation during input processing
> >
> > In case of either one or all of the above conditions, PMD is required to
> maintain state of op across enque_burst() calls and
> > ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> RTE_COMP_FULL/FINAL_FLUSH.
> >
> > D.2.1 Stateful operation state maintenance
> > ---------------------------------------------------------------
> > It is always an ideal expectation from application that it should parse
> through all related chunk of source data making its mbuf-chain and enqueue
> it for stateless processing.
> > However, if it need to break it into several enqueue_burst() calls, then an
> expected call flow would be something like:
> >
> > enqueue_burst( |op.no_flush |)
> 
> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> burst in a loop until all ops are received. Is this correct?
> 
> > deque_burst(op) // should dequeue before we enqueue next

[Shally] Yes. Ideally every submitted op need to be dequeued. However this illustration is specifically in context of stateful op processing to reflect if a stream is broken into chunks, then each chunk should be submitted as one op at-a-time with type = STATEFUL and need to be dequeued first before next chunk is enqueued.

> > enqueue_burst( |op.no_flush |)
> > deque_burst(op) // should dequeue before we enqueue next
> > enqueue_burst( |op.full_flush |)
> 
> [Ahmed] Why now allow multiple work items in flight? I understand that
> occasionaly there will be OUT_OF_SPACE exception. Can we just distinguish
> the response in exception cases?

[Shally] Multiples ops are allowed in flight, however condition is each op in such case is independent of each other i.e. belong to different streams altogether.
Earlier (as part of RFC v1 doc) we did consider the proposal to process all related chunks of data in single burst by passing them as ops array but later found that as not-so-useful for PMD handling for various reasons. You may please refer to RFC v1 doc review comments for same.
 
> >
> > Here an op *must* be attached to a stream and every subsequent
> enqueue_burst() call should carry *same* stream. Since PMD maintain ops
> state in stream, thus it is mandatory for application to attach stream to such
> ops.
> >
> > D.2.2 Stateful and Out_of_Space
> > --------------------------------------------
> > If PMD support stateful and run into OUT_OF_SPACE situation, then it is
> not an error condition for PMD. In such case, PMD return with status
> RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of input
> bytes read and produced = length of complete output buffer.
> > Application should enqueue op with source starting at consumed+1 and
> output buffer with available space.
> 
> [Ahmed] Related to OUT_OF_SPACE. What status does the user recieve in a
> decompression case when the end block is encountered before the end of
> the input? Does the PMD continue decomp? Does it stop there and return
> the stop index?
> 

[Shally] Before I could answer this, please help me understand your use case . When you say  "when the end block is encountered before the end of the input?" Do you mean -
"Decompressor process a final block (i.e. has BFINAL=1 in its header) and there's some footer data after that?" Or 
you mean "decompressor process one block and has more to process till its final block?"
What is "end block" and "end of input" reference here?

> >
> > D.2.3 Sliding Window Size
> > ------------------------------------
> > Every PMD will reflect in its algorithm capability structure maximum length
> of Sliding Window in bytes which would indicate maximum history buffer
> length used by algo.
> >
> > 2. Example API illustration
> > ~~~~~~~~~~~~~~~~~~~~~~~
> >
> > Following is an illustration on API usage  (This is just one flow, other variants
> are also possible):
> > 1. rte_comp_session *sess = rte_compressdev_session_create
> (rte_mempool *pool);
> > 2. rte_compressdev_session_init (int dev_id, rte_comp_session *sess,
> rte_comp_xform *xform, rte_mempool *sess_pool);
> > 3. rte_comp_op_pool_create(rte_mempool ..)
> > 4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
> rte_comp_op **ops, uint16_t nb_ops);
> > 5. for every rte_comp_op in ops[],
> >     5.1 rte_comp_op_attach_session (rte_comp_op *op, rte_comp_session
> *sess);
> >     5.2 op.op_type = RTE_COMP_OP_STATELESS
> >     5.3 op.flush = RTE_FLUSH_FINAL
> > 6. [Optional] for every rte_comp_op in ops[],
> >     6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess, void
> **stream);
> >     6.2 rte_comp_op_attach_stream(rte_comp_op *op, rte_comp_session
> *stream);
> 
> [Ahmed] What is the semantic effect of attaching a stream to every op? will
> this application benefit for this given that it is setup with op_type STATELESS

[Shally] By role, stream is data structure that hold all information that PMD need to maintain for an op processing and thus it's marked device specific. It is required for stateful processing but optional for statelss as PMD doesn't need to maintain context once op is processed unlike stateful.
It may be of advantage to use stream for stateless to some of the PMD. They can be designed to do one-time per op setup (such as mapping session params) during stream_create() in control path than data path.

> 
> > 7.for every rte_comp_op in ops[],
> >      7.1 set up with src/dst buffer
> > 8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops, nb_ops);
> > 9. do while (dqu < enq) // Wait till all of enqueued are dequeued
> >     9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops, enq);
> 
> [Ahmed] I am assuming that waiting for all enqueued to be dequeued is not
> strictly necessary, but is just the chosen example in this case
> 

[Shally] Yes. By design, for burst_size>1 each op is independent of each other. So app may proceed as soon as it dequeue any.

> > 10. Repeat 7 for next batch of data
> > 11. for every ops in ops[]
> >       11.1 rte_comp_stream_free(op->stream);
> > 11. rte_comp_session_clear (sess) ;
> > 12. rte_comp_session_terminate(ret_comp_sess *session)
> >
> > Thanks
> > Shally
> >
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-10 12:55   ` Verma, Shally
@ 2018-01-11 18:53     ` Trahe, Fiona
  2018-01-12 13:49       ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Trahe, Fiona @ 2018-01-11 18:53 UTC (permalink / raw)
  To: Verma, Shally, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona

Hi Shally, Ahmed,


> -----Original Message-----
> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> Sent: Wednesday, January 10, 2018 12:55 PM
> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: RE: [RFC v2] doc compression API for DPDK
> 
> HI Ahmed
> 
> > -----Original Message-----
> > From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> > Sent: 10 January 2018 00:38
> > To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona
> > <fiona.trahe@intel.com>; dev@dpdk.org
> > Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> > Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> > <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> > <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>;
> > Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> > <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> > Subject: Re: [RFC v2] doc compression API for DPDK
> >
> > Hi Shally,
> >
> > Thanks for the summary. It is very helpful. Please see comments below
> >
> >
> > On 1/4/2018 6:45 AM, Verma, Shally wrote:
> > > This is an RFC v2 document to brief understanding and requirements on
> > compression API proposal in DPDK. It is based on "[RFC v3] Compression API
> > in DPDK
> > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpd
> > k.org%2Fdev%2Fpatchwork%2Fpatch%2F32331%2F&data=02%7C01%7Cahm
> > ed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea
> > 1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=JF
> > tOnJxajgXX7s3DMZ79K7VVM7TXO8lBd6rNeVlsHDg%3D&reserved=0 ".
> > > Intention of this document is to align on concepts built into compression
> > API, its usage and identify further requirements.
> > >
> > > Going further it could be a base to Compression Module Programmer
> > Guide.
> > >
> > > Current scope is limited to
> > > - definition of the terminology which makes up foundation of compression
> > API
> > > - typical API flow expected to use by applications
> > > - Stateless and Stateful operation definition and usage after RFC v1 doc
> > review
> > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdev.
> > dpdk.narkive.com%2FCHS5l01B%2Fdpdk-dev-rfc-v1-doc-compression-api-
> > for-
> > dpdk&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473
> > fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6
> > 36506631207323264&sdata=Fy7xKIyxZX97i7vEM6NqgrvnqKrNrWOYLwIA5dEH
> > QNQ%3D&reserved=0
> > >
> > > 1. Overview
> > > ~~~~~~~~~~~
> > >
> > > A. Compression Methodologies in compression API
> > > ===========================================
> > > DPDK compression supports two types of compression methodologies:
> > > - Stateless - each data object is compressed individually without any
> > reference to previous data,
> > > - Stateful -  each data object is compressed with reference to previous data
> > object i.e. history of data is needed for compression / decompression
> > > For more explanation, please refer RFC
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > ww.ietf.org%2Frfc%2Frfc1951.txt&data=02%7C01%7Cahmed.mansour%40nx
> > p.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd9
> > 9c5c301635%7C0%7C0%7C636506631207323264&sdata=pfp2VX1w3UxH5YLcL
> > 2R%2BvKXNeS7jP46CsASq0B1SETw%3D&reserved=0
> > >
> > > To support both methodologies, DPDK compression introduces two key
> > concepts: Session and Stream.
> > >
> > > B. Notion of a session in compression API
> > > ==================================
> > > A Session in DPDK compression is a logical entity which is setup one-time
> > with immutable parameters i.e. parameters that don't change across
> > operations and devices.
> > > A session can be shared across multiple devices and multiple operations
> > simultaneously.
> > > A typical Session parameters includes info such as:
> > > - compress / decompress
> > > - compression algorithm and associated configuration parameters
> > >
> > > Application can create different sessions on a device initialized with
> > same/different xforms. Once a session is initialized with one xform it cannot
> > be re-initialized.
> > >
> > > C. Notion of stream in compression API
> > >  =======================================
> > > Unlike session which carry common set of information across operations, a
> > stream in DPDK compression is a logical entity which identify related set of
> > operations and carry operation specific information as needed by device
> > during its processing.
> > > It is device specific data structure which is opaque to application, setup and
> > maintained by device.
> > >
> > > A stream can be used with *only* one op at a time i.e. no two operations
> > can share same stream simultaneously.
> > > A stream is *must* for stateful ops processing and optional for stateless
> > (Please see respective sections for more details).
> > >
> > > This enables sharing of a session by multiple threads handling different
> > data set as each op carry its own context (internal states, history buffers et
> > el) in its attached stream.
> > > Application should call rte_comp_stream_create() and attach to op before
> > beginning of  operation processing and free via rte_comp_stream_free()
> > after its complete.
> > >
> > > C. Notion of burst operations in compression API
> > >  =======================================
> > > A burst in DPDK compression is an array of operations where each op carry
> > independent set of data. i.e. a burst can look like:
> > >
> > >                                       ---------------------------------------------------------------------
> > ------------------------------------
> > >               enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
> > op4.no_flush | op5.no_flush |)
> > >                                        --------------------------------------------------------------------
> > -------------------------------------
> > >
> > > Where, op1 .. op5 are all independent of each other and carry entirely
> > different set of data.
> > > Each op can be attached to same/different session but *must* be attached
> > to different stream.
> > >
> > > Each op (struct rte_comp_op) carry compression/decompression
> > operational parameter and is both an input/output parameter.
> > > PMD gets source, destination and checksum information at input and
> > update it with bytes consumed and produced and checksum at output.
> > >
> > > Since each operation in a burst is independent and thus can complete out-
> > of-order,  applications which need ordering, should setup per-op user data
> > area with reordering information so that it can determine enqueue order at
> > deque.
> > >
> > > Also if multiple threads calls enqueue_burst() on same queue pair then it's
> > application onus to use proper locking mechanism to ensure exclusive
> > enqueuing of operations.
> > >
> > > D. Stateless Vs Stateful
> > > ===================
> > > Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD
> > to reflect its support for Stateful operation. Each op carry an op type
> > indicating if it's to be processed stateful or stateless.
> > >
> > > D.1 Compression API Stateless operation
> > > ------------------------------------------------------
> > > An op is processed stateless if it has
> > > -              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
> > (required only on compression side),
> > > -	 op_type set to RTE_COMP_OP_STATELESS
> > > -              All-of the required input and sufficient large output buffer to store
> > output i.e. OUT_OF_SPACE can never occur.
> > >
> > > When all of the above conditions are met, PMD initiates stateless
> > processing and releases acquired resources after processing of current
> > operation is complete i.e. full input consumed and full output written.
[Fiona] I think 3rd condition conflicts with D1.1 below and anyway cannot be a precondition. i.e. 
PMD must initiate stateless processing based on RTE_COMP_OP_STATELESS.
It can't always know if the output buffer is big enough before processing, it must process the input data and 
only when it has consumed it all can it know that all the output data fits or doesn't fit in the output buffer.

I'd suggest rewording as follows:
An op is processed statelessly if op_type is set to RTE_COMP_OP_STATELESS
In this case
- The flush value must be set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL (required only on compression side),
- All of the input data must be in the src buffer
- The dst buffer should be sufficiently large enough to hold the expected output
The PMD acquires the necessary resources to process the op. After processing of current operation is 
complete, whether successful or not, it releases acquired resources and no state, history or data is
held in the PMD or carried over to subsequent ops.
In SUCCESS case full input is consumed and full output written and status is set to RTE_COMP_OP_STATUS_SUCCESS.
OUT-OF-SPACE as D1.1 below.

> > > Application can optionally attach a stream to such ops. In such case,
> > application must attach different stream to each op.
> > >
> > > Application can enqueue stateless burst via making consecutive
> > enque_burst() calls i.e. Following is relevant usage:
> > >
> > > enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
> > > enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
> > >
> > > *Note - Every call has different ops array i.e.  same rte_comp_op array
> > *cannot be re-enqueued* to process next batch of data until previous ones
> > are completely processed.
> > >
> > > D.1.1 Stateless and OUT_OF_SPACE
> > > ------------------------------------------------
> > > OUT_OF_SPACE is a condition when output buffer runs out of space and
> > where PMD still has more data to produce. If PMD run into such condition,
> > then it's an error condition in stateless processing.
> > > In such case, PMD resets itself and return with status
> > RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0 i.e.
> > no input read, no output written.
> > > Application can resubmit an full input with larger output buffer size.
> >
> > [Ahmed] Can we add an option to allow the user to read the data that was
> > produced while still reporting OUT_OF_SPACE? this is mainly useful for
> > decompression applications doing search.
> 
> [Shally] It is there but applicable for stateful operation type (please refer to handling out_of_space under
> "Stateful Section").
> By definition, "stateless" here means that application (such as IPCOMP) knows maximum output size
> guaranteedly and ensure that uncompressed data size cannot grow more than provided output buffer.
> Such apps can submit an op with type = STATELESS and provide full input, then PMD assume it has
> sufficient input and output and thus doesn't need to maintain any contexts after op is processed.
> If application doesn't know about max output size, then it should process it as stateful op i.e. setup op
> with type = STATEFUL and attach a stream so that PMD can maintain relevant context to handle such
> condition.
[Fiona] There may be an alternative that's useful for Ahmed, while still respecting the stateless concept.
In Stateless case where a PMD reports OUT_OF_SPACE in decompression case 
it could also return consumed=0, produced = x, where x>0. X indicates the amount of valid data which has
 been written to the output buffer. It is not complete, but if an application wants to search it it may be sufficient.
If the application still wants the data it must resubmit the whole input with a bigger output buffer, and
 decompression will be repeated from the start, it
 cannot expect to continue on as the PMD has not maintained state, history or data.
I don't think there would be any need to indicate this in capabilities, PMDs which cannot provide this 
functionality would always return produced=consumed=0, while PMDs which can could set produced > 0.
If this works for you both, we could consider a similar case for compression.

> 
> >
> > > D.2 Compression API Stateful operation
> > > ----------------------------------------------------------
> > >  A Stateful operation in DPDK compression means application invokes
> > enqueue burst() multiple times to process related chunk of data either
> > because
> > > - Application broke data into several ops, and/or
> > > - PMD ran into out_of_space situation during input processing
> > >
> > > In case of either one or all of the above conditions, PMD is required to
> > maintain state of op across enque_burst() calls and
> > > ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> > flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> > RTE_COMP_FULL/FINAL_FLUSH.
> > >
> > > D.2.1 Stateful operation state maintenance
> > > ---------------------------------------------------------------
> > > It is always an ideal expectation from application that it should parse
> > through all related chunk of source data making its mbuf-chain and enqueue
> > it for stateless processing.
> > > However, if it need to break it into several enqueue_burst() calls, then an
> > expected call flow would be something like:
> > >
> > > enqueue_burst( |op.no_flush |)
> >
> > [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> > burst in a loop until all ops are received. Is this correct?
> >
> > > deque_burst(op) // should dequeue before we enqueue next
> 
> [Shally] Yes. Ideally every submitted op need to be dequeued. However this illustration is specifically in
> context of stateful op processing to reflect if a stream is broken into chunks, then each chunk should be
> submitted as one op at-a-time with type = STATEFUL and need to be dequeued first before next chunk is
> enqueued.
> 
> > > enqueue_burst( |op.no_flush |)
> > > deque_burst(op) // should dequeue before we enqueue next
> > > enqueue_burst( |op.full_flush |)
> >
> > [Ahmed] Why now allow multiple work items in flight? I understand that
> > occasionaly there will be OUT_OF_SPACE exception. Can we just distinguish
> > the response in exception cases?
> 
> [Shally] Multiples ops are allowed in flight, however condition is each op in such case is independent of
> each other i.e. belong to different streams altogether.
> Earlier (as part of RFC v1 doc) we did consider the proposal to process all related chunks of data in single
> burst by passing them as ops array but later found that as not-so-useful for PMD handling for various
> reasons. You may please refer to RFC v1 doc review comments for same.
[Fiona] Agree with Shally. In summary, as only one op can be processed at a time, since each needs the
state of the previous, to allow more than 1 op to be in-flight at a time would
force PMDs to implement internal queueing and exception handling for OUT_OF_SPACE conditions you mention.
If the application has all the data, it can put it into chained mbufs in a single op rather than
multiple ops, which avoids pushing all that complexity down to the PMDs.

> 
> > >
> > > Here an op *must* be attached to a stream and every subsequent
> > enqueue_burst() call should carry *same* stream. Since PMD maintain ops
> > state in stream, thus it is mandatory for application to attach stream to such
> > ops.
[Fiona] I think you're referring only to a single stream above, but as there may be many different streams,
maybe add the following?
Above is simplified to show just a single stream. However there may be many streams, and each 
enqueue_burst() may contain ops from different streams, as long as there is only one op in-flight from any
stream at a given time.


> > >
> > > D.2.2 Stateful and Out_of_Space
> > > --------------------------------------------
> > > If PMD support stateful and run into OUT_OF_SPACE situation, then it is
> > not an error condition for PMD. In such case, PMD return with status
> > RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of input
> > bytes read and produced = length of complete output buffer.
[Fiona] - produced would be <= output buffer len (typically =, but could be a few bytes less)


> > > Application should enqueue op with source starting at consumed+1 and
> > output buffer with available space.
> >
> > [Ahmed] Related to OUT_OF_SPACE. What status does the user recieve in a
> > decompression case when the end block is encountered before the end of
> > the input? Does the PMD continue decomp? Does it stop there and return
> > the stop index?
> >
> 
> [Shally] Before I could answer this, please help me understand your use case . When you say  "when the
> end block is encountered before the end of the input?" Do you mean -
> "Decompressor process a final block (i.e. has BFINAL=1 in its header) and there's some footer data after
> that?" Or
> you mean "decompressor process one block and has more to process till its final block?"
> What is "end block" and "end of input" reference here?
> 
> > >
> > > D.2.3 Sliding Window Size
> > > ------------------------------------
> > > Every PMD will reflect in its algorithm capability structure maximum length
> > of Sliding Window in bytes which would indicate maximum history buffer
> > length used by algo.
> > >
> > > 2. Example API illustration
> > > ~~~~~~~~~~~~~~~~~~~~~~~
> > >
[Fiona] I think it would be useful to show an example of both a STATELESS flow and a STATEFUL flow.

> > > Following is an illustration on API usage  (This is just one flow, other variants
> > are also possible):
> > > 1. rte_comp_session *sess = rte_compressdev_session_create
> > (rte_mempool *pool);
> > > 2. rte_compressdev_session_init (int dev_id, rte_comp_session *sess,
> > rte_comp_xform *xform, rte_mempool *sess_pool);
> > > 3. rte_comp_op_pool_create(rte_mempool ..)
> > > 4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
> > rte_comp_op **ops, uint16_t nb_ops);
> > > 5. for every rte_comp_op in ops[],
> > >     5.1 rte_comp_op_attach_session (rte_comp_op *op, rte_comp_session
> > *sess);
> > >     5.2 op.op_type = RTE_COMP_OP_STATELESS
> > >     5.3 op.flush = RTE_FLUSH_FINAL
> > > 6. [Optional] for every rte_comp_op in ops[],
> > >     6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess, void
> > **stream);
> > >     6.2 rte_comp_op_attach_stream(rte_comp_op *op, rte_comp_session
> > *stream);
> >
> > [Ahmed] What is the semantic effect of attaching a stream to every op? will
> > this application benefit for this given that it is setup with op_type STATELESS
> 
> [Shally] By role, stream is data structure that hold all information that PMD need to maintain for an op
> processing and thus it's marked device specific. It is required for stateful processing but optional for
> statelss as PMD doesn't need to maintain context once op is processed unlike stateful.
> It may be of advantage to use stream for stateless to some of the PMD. They can be designed to do one-
> time per op setup (such as mapping session params) during stream_create() in control path than data
> path.
> 
[Fiona] yes, we agreed that stream_create() should be called for every session and if it
returns non-NULL the PMD needs it so op_attach_stream() must be called.
However I've just realised we don't have a STATEFUL/STATELESS param on the xform, just on the op.
So we could either add stateful/stateless param to stream_create() ?
OR add stateful/stateless param to xform so it would be in session?
However, Shally, can you reconsider if you really need it for STATELESS or if the data you want to 
store there could be stored in the session? Or if it's needed per-op does it really need
to be visible on the API as a stream or could it be hidden within the PMD?

> >
> > > 7.for every rte_comp_op in ops[],
> > >      7.1 set up with src/dst buffer
> > > 8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops, nb_ops);
> > > 9. do while (dqu < enq) // Wait till all of enqueued are dequeued
> > >     9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops, enq);
> >
> > [Ahmed] I am assuming that waiting for all enqueued to be dequeued is not
> > strictly necessary, but is just the chosen example in this case
> >
> 
> [Shally] Yes. By design, for burst_size>1 each op is independent of each other. So app may proceed as soon
> as it dequeue any.
> 
> > > 10. Repeat 7 for next batch of data
> > > 11. for every ops in ops[]
> > >       11.1 rte_comp_stream_free(op->stream);
> > > 11. rte_comp_session_clear (sess) ;
> > > 12. rte_comp_session_terminate(ret_comp_sess *session)
> > >
> > > Thanks
> > > Shally
> > >
> > >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-11 18:53     ` Trahe, Fiona
@ 2018-01-12 13:49       ` Verma, Shally
  2018-01-25 18:19         ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-01-12 13:49 UTC (permalink / raw)
  To: Trahe, Fiona, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi Fiona

> -----Original Message-----
> From: Trahe, Fiona [mailto:fiona.trahe@intel.com]
> Sent: 12 January 2018 00:24
> To: Verma, Shally <Shally.Verma@cavium.com>; Ahmed Mansour
> <ahmed.mansour@nxp.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>;
> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>; Trahe,
> Fiona <fiona.trahe@intel.com>
> Subject: RE: [RFC v2] doc compression API for DPDK
> 
> Hi Shally, Ahmed,
> 
> 
> > -----Original Message-----
> > From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> > Sent: Wednesday, January 10, 2018 12:55 PM
> > To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona
> <fiona.trahe@intel.com>; dev@dpdk.org
> > Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> Gupta, Ashish
> > <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>;
> De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> <Mahipal.Challa@cavium.com>; Jain, Deepak K
> > <deepak.k.jain@intel.com>; Hemant Agrawal
> <hemant.agrawal@nxp.com>; Roy Pledge
> > <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> > Subject: RE: [RFC v2] doc compression API for DPDK
> >
> > HI Ahmed
> >
> > > -----Original Message-----
> > > From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> > > Sent: 10 January 2018 00:38
> > > To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona
> > > <fiona.trahe@intel.com>; dev@dpdk.org
> > > Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> > > Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> > > <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> > > <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>;
> > > Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> > > <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> > > Subject: Re: [RFC v2] doc compression API for DPDK
> > >
> > > Hi Shally,
> > >
> > > Thanks for the summary. It is very helpful. Please see comments below
> > >
> > >
> > > On 1/4/2018 6:45 AM, Verma, Shally wrote:
> > > > This is an RFC v2 document to brief understanding and requirements on
> > > compression API proposal in DPDK. It is based on "[RFC v3] Compression
> API
> > > in DPDK
> > >
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpd
> > >
> k.org%2Fdev%2Fpatchwork%2Fpatch%2F32331%2F&data=02%7C01%7Cahm
> > >
> ed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea
> > >
> 1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=JF
> > > tOnJxajgXX7s3DMZ79K7VVM7TXO8lBd6rNeVlsHDg%3D&reserved=0 ".
> > > > Intention of this document is to align on concepts built into
> compression
> > > API, its usage and identify further requirements.
> > > >
> > > > Going further it could be a base to Compression Module Programmer
> > > Guide.
> > > >
> > > > Current scope is limited to
> > > > - definition of the terminology which makes up foundation of
> compression
> > > API
> > > > - typical API flow expected to use by applications
> > > > - Stateless and Stateful operation definition and usage after RFC v1 doc
> > > review
> > >
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdev.
> > > dpdk.narkive.com%2FCHS5l01B%2Fdpdk-dev-rfc-v1-doc-compression-
> api-
> > > for-
> > >
> dpdk&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473
> > >
> fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6
> > >
> 36506631207323264&sdata=Fy7xKIyxZX97i7vEM6NqgrvnqKrNrWOYLwIA5dEH
> > > QNQ%3D&reserved=0
> > > >
> > > > 1. Overview
> > > > ~~~~~~~~~~~
> > > >
> > > > A. Compression Methodologies in compression API
> > > > ===========================================
> > > > DPDK compression supports two types of compression methodologies:
> > > > - Stateless - each data object is compressed individually without any
> > > reference to previous data,
> > > > - Stateful -  each data object is compressed with reference to previous
> data
> > > object i.e. history of data is needed for compression / decompression
> > > > For more explanation, please refer RFC
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> > >
> ww.ietf.org%2Frfc%2Frfc1951.txt&data=02%7C01%7Cahmed.mansour%40nx
> > >
> p.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd9
> > >
> 9c5c301635%7C0%7C0%7C636506631207323264&sdata=pfp2VX1w3UxH5YLcL
> > > 2R%2BvKXNeS7jP46CsASq0B1SETw%3D&reserved=0
> > > >
> > > > To support both methodologies, DPDK compression introduces two key
> > > concepts: Session and Stream.
> > > >
> > > > B. Notion of a session in compression API
> > > > ==================================
> > > > A Session in DPDK compression is a logical entity which is setup one-
> time
> > > with immutable parameters i.e. parameters that don't change across
> > > operations and devices.
> > > > A session can be shared across multiple devices and multiple operations
> > > simultaneously.
> > > > A typical Session parameters includes info such as:
> > > > - compress / decompress
> > > > - compression algorithm and associated configuration parameters
> > > >
> > > > Application can create different sessions on a device initialized with
> > > same/different xforms. Once a session is initialized with one xform it
> cannot
> > > be re-initialized.
> > > >
> > > > C. Notion of stream in compression API
> > > >  =======================================
> > > > Unlike session which carry common set of information across
> operations, a
> > > stream in DPDK compression is a logical entity which identify related set
> of
> > > operations and carry operation specific information as needed by device
> > > during its processing.
> > > > It is device specific data structure which is opaque to application, setup
> and
> > > maintained by device.
> > > >
> > > > A stream can be used with *only* one op at a time i.e. no two
> operations
> > > can share same stream simultaneously.
> > > > A stream is *must* for stateful ops processing and optional for
> stateless
> > > (Please see respective sections for more details).
> > > >
> > > > This enables sharing of a session by multiple threads handling different
> > > data set as each op carry its own context (internal states, history buffers
> et
> > > el) in its attached stream.
> > > > Application should call rte_comp_stream_create() and attach to op
> before
> > > beginning of  operation processing and free via rte_comp_stream_free()
> > > after its complete.
> > > >
> > > > C. Notion of burst operations in compression API
> > > >  =======================================
> > > > A burst in DPDK compression is an array of operations where each op
> carry
> > > independent set of data. i.e. a burst can look like:
> > > >
> > > >                                       ----------------------------------------------------------------
> -----
> > > ------------------------------------
> > > >               enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
> > > op4.no_flush | op5.no_flush |)
> > > >                                        ----------------------------------------------------------------
> ----
> > > -------------------------------------
> > > >
> > > > Where, op1 .. op5 are all independent of each other and carry entirely
> > > different set of data.
> > > > Each op can be attached to same/different session but *must* be
> attached
> > > to different stream.
> > > >
> > > > Each op (struct rte_comp_op) carry compression/decompression
> > > operational parameter and is both an input/output parameter.
> > > > PMD gets source, destination and checksum information at input and
> > > update it with bytes consumed and produced and checksum at output.
> > > >
> > > > Since each operation in a burst is independent and thus can complete
> out-
> > > of-order,  applications which need ordering, should setup per-op user
> data
> > > area with reordering information so that it can determine enqueue order
> at
> > > deque.
> > > >
> > > > Also if multiple threads calls enqueue_burst() on same queue pair then
> it's
> > > application onus to use proper locking mechanism to ensure exclusive
> > > enqueuing of operations.
> > > >
> > > > D. Stateless Vs Stateful
> > > > ===================
> > > > Compression API provide RTE_COMP_FF_STATEFUL feature flag for
> PMD
> > > to reflect its support for Stateful operation. Each op carry an op type
> > > indicating if it's to be processed stateful or stateless.
> > > >
> > > > D.1 Compression API Stateless operation
> > > > ------------------------------------------------------
> > > > An op is processed stateless if it has
> > > > -              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
> > > (required only on compression side),
> > > > -	 op_type set to RTE_COMP_OP_STATELESS
> > > > -              All-of the required input and sufficient large output buffer to
> store
> > > output i.e. OUT_OF_SPACE can never occur.
> > > >
> > > > When all of the above conditions are met, PMD initiates stateless
> > > processing and releases acquired resources after processing of current
> > > operation is complete i.e. full input consumed and full output written.
> [Fiona] I think 3rd condition conflicts with D1.1 below and anyway cannot be
> a precondition. i.e.
> PMD must initiate stateless processing based on RTE_COMP_OP_STATELESS.
> It can't always know if the output buffer is big enough before processing, it
> must process the input data and
> only when it has consumed it all can it know that all the output data fits or
> doesn't fit in the output buffer.
> 
> I'd suggest rewording as follows:
> An op is processed statelessly if op_type is set to RTE_COMP_OP_STATELESS
> In this case
> - The flush value must be set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
> (required only on compression side),
> - All of the input data must be in the src buffer
> - The dst buffer should be sufficiently large enough to hold the expected
> output
> The PMD acquires the necessary resources to process the op. After
> processing of current operation is
> complete, whether successful or not, it releases acquired resources and no
> state, history or data is
> held in the PMD or carried over to subsequent ops.
> In SUCCESS case full input is consumed and full output written and status is
> set to RTE_COMP_OP_STATUS_SUCCESS.
> OUT-OF-SPACE as D1.1 below.
> 

[Shally] Ok. Agreed.

> > > > Application can optionally attach a stream to such ops. In such case,
> > > application must attach different stream to each op.
> > > >
> > > > Application can enqueue stateless burst via making consecutive
> > > enque_burst() calls i.e. Following is relevant usage:
> > > >
> > > > enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
> > > > enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
> > > >
> > > > *Note - Every call has different ops array i.e.  same rte_comp_op array
> > > *cannot be re-enqueued* to process next batch of data until previous
> ones
> > > are completely processed.
> > > >
> > > > D.1.1 Stateless and OUT_OF_SPACE
> > > > ------------------------------------------------
> > > > OUT_OF_SPACE is a condition when output buffer runs out of space
> and
> > > where PMD still has more data to produce. If PMD run into such
> condition,
> > > then it's an error condition in stateless processing.
> > > > In such case, PMD resets itself and return with status
> > > RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
> i.e.
> > > no input read, no output written.
> > > > Application can resubmit an full input with larger output buffer size.
> > >
> > > [Ahmed] Can we add an option to allow the user to read the data that
> was
> > > produced while still reporting OUT_OF_SPACE? this is mainly useful for
> > > decompression applications doing search.
> >
> > [Shally] It is there but applicable for stateful operation type (please refer to
> handling out_of_space under
> > "Stateful Section").
> > By definition, "stateless" here means that application (such as IPCOMP)
> knows maximum output size
> > guaranteedly and ensure that uncompressed data size cannot grow more
> than provided output buffer.
> > Such apps can submit an op with type = STATELESS and provide full input,
> then PMD assume it has
> > sufficient input and output and thus doesn't need to maintain any contexts
> after op is processed.
> > If application doesn't know about max output size, then it should process it
> as stateful op i.e. setup op
> > with type = STATEFUL and attach a stream so that PMD can maintain
> relevant context to handle such
> > condition.
> [Fiona] There may be an alternative that's useful for Ahmed, while still
> respecting the stateless concept.
> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
> case
> it could also return consumed=0, produced = x, where x>0. X indicates the
> amount of valid data which has
>  been written to the output buffer. It is not complete, but if an application
> wants to search it it may be sufficient.
> If the application still wants the data it must resubmit the whole input with a
> bigger output buffer, and
>  decompression will be repeated from the start, it
>  cannot expect to continue on as the PMD has not maintained state, history
> or data.
> I don't think there would be any need to indicate this in capabilities, PMDs
> which cannot provide this
> functionality would always return produced=consumed=0, while PMDs which
> can could set produced > 0.
> If this works for you both, we could consider a similar case for compression.
> 

[Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual consumed by PMD.
Setting consumed = 0 with produced > 0 doesn't correlate. 

> >
> > >
> > > > D.2 Compression API Stateful operation
> > > > ----------------------------------------------------------
> > > >  A Stateful operation in DPDK compression means application invokes
> > > enqueue burst() multiple times to process related chunk of data either
> > > because
> > > > - Application broke data into several ops, and/or
> > > > - PMD ran into out_of_space situation during input processing
> > > >
> > > > In case of either one or all of the above conditions, PMD is required to
> > > maintain state of op across enque_burst() calls and
> > > > ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> > > flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> > > RTE_COMP_FULL/FINAL_FLUSH.
> > > >
> > > > D.2.1 Stateful operation state maintenance
> > > > ---------------------------------------------------------------
> > > > It is always an ideal expectation from application that it should parse
> > > through all related chunk of source data making its mbuf-chain and
> enqueue
> > > it for stateless processing.
> > > > However, if it need to break it into several enqueue_burst() calls, then
> an
> > > expected call flow would be something like:
> > > >
> > > > enqueue_burst( |op.no_flush |)
> > >
> > > [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> > > burst in a loop until all ops are received. Is this correct?
> > >
> > > > deque_burst(op) // should dequeue before we enqueue next
> >
> > [Shally] Yes. Ideally every submitted op need to be dequeued. However
> this illustration is specifically in
> > context of stateful op processing to reflect if a stream is broken into
> chunks, then each chunk should be
> > submitted as one op at-a-time with type = STATEFUL and need to be
> dequeued first before next chunk is
> > enqueued.
> >
> > > > enqueue_burst( |op.no_flush |)
> > > > deque_burst(op) // should dequeue before we enqueue next
> > > > enqueue_burst( |op.full_flush |)
> > >
> > > [Ahmed] Why now allow multiple work items in flight? I understand that
> > > occasionaly there will be OUT_OF_SPACE exception. Can we just
> distinguish
> > > the response in exception cases?
> >
> > [Shally] Multiples ops are allowed in flight, however condition is each op in
> such case is independent of
> > each other i.e. belong to different streams altogether.
> > Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> related chunks of data in single
> > burst by passing them as ops array but later found that as not-so-useful for
> PMD handling for various
> > reasons. You may please refer to RFC v1 doc review comments for same.
> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> time, since each needs the
> state of the previous, to allow more than 1 op to be in-flight at a time would
> force PMDs to implement internal queueing and exception handling for
> OUT_OF_SPACE conditions you mention.
> If the application has all the data, it can put it into chained mbufs in a single
> op rather than
> multiple ops, which avoids pushing all that complexity down to the PMDs.
> 
> >
> > > >
> > > > Here an op *must* be attached to a stream and every subsequent
> > > enqueue_burst() call should carry *same* stream. Since PMD maintain
> ops
> > > state in stream, thus it is mandatory for application to attach stream to
> such
> > > ops.
> [Fiona] I think you're referring only to a single stream above, but as there
> may be many different streams,
> maybe add the following?
> Above is simplified to show just a single stream. However there may be
> many streams, and each
> enqueue_burst() may contain ops from different streams, as long as there is
> only one op in-flight from any
> stream at a given time.
> 

[Shally] Ok get it. 

> 
> > > >
> > > > D.2.2 Stateful and Out_of_Space
> > > > --------------------------------------------
> > > > If PMD support stateful and run into OUT_OF_SPACE situation, then it is
> > > not an error condition for PMD. In such case, PMD return with status
> > > RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of
> input
> > > bytes read and produced = length of complete output buffer.
> [Fiona] - produced would be <= output buffer len (typically =, but could be a
> few bytes less)
> 
> 
> > > > Application should enqueue op with source starting at consumed+1 and
> > > output buffer with available space.
> > >
> > > [Ahmed] Related to OUT_OF_SPACE. What status does the user recieve
> in a
> > > decompression case when the end block is encountered before the end
> of
> > > the input? Does the PMD continue decomp? Does it stop there and
> return
> > > the stop index?
> > >
> >
> > [Shally] Before I could answer this, please help me understand your use
> case . When you say  "when the
> > end block is encountered before the end of the input?" Do you mean -
> > "Decompressor process a final block (i.e. has BFINAL=1 in its header) and
> there's some footer data after
> > that?" Or
> > you mean "decompressor process one block and has more to process till its
> final block?"
> > What is "end block" and "end of input" reference here?
> >
> > > >
> > > > D.2.3 Sliding Window Size
> > > > ------------------------------------
> > > > Every PMD will reflect in its algorithm capability structure maximum
> length
> > > of Sliding Window in bytes which would indicate maximum history buffer
> > > length used by algo.
> > > >
> > > > 2. Example API illustration
> > > > ~~~~~~~~~~~~~~~~~~~~~~~
> > > >
> [Fiona] I think it would be useful to show an example of both a STATELESS
> flow and a STATEFUL flow.
> 

[Shally] Ok. I can add simplified version to illustrate API usage in both cases.

> > > > Following is an illustration on API usage  (This is just one flow, other
> variants
> > > are also possible):
> > > > 1. rte_comp_session *sess = rte_compressdev_session_create
> > > (rte_mempool *pool);
> > > > 2. rte_compressdev_session_init (int dev_id, rte_comp_session *sess,
> > > rte_comp_xform *xform, rte_mempool *sess_pool);
> > > > 3. rte_comp_op_pool_create(rte_mempool ..)
> > > > 4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
> > > rte_comp_op **ops, uint16_t nb_ops);
> > > > 5. for every rte_comp_op in ops[],
> > > >     5.1 rte_comp_op_attach_session (rte_comp_op *op,
> rte_comp_session
> > > *sess);
> > > >     5.2 op.op_type = RTE_COMP_OP_STATELESS
> > > >     5.3 op.flush = RTE_FLUSH_FINAL
> > > > 6. [Optional] for every rte_comp_op in ops[],
> > > >     6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess,
> void
> > > **stream);
> > > >     6.2 rte_comp_op_attach_stream(rte_comp_op *op,
> rte_comp_session
> > > *stream);
> > >
> > > [Ahmed] What is the semantic effect of attaching a stream to every op?
> will
> > > this application benefit for this given that it is setup with op_type
> STATELESS
> >
> > [Shally] By role, stream is data structure that hold all information that PMD
> need to maintain for an op
> > processing and thus it's marked device specific. It is required for stateful
> processing but optional for
> > statelss as PMD doesn't need to maintain context once op is processed
> unlike stateful.
> > It may be of advantage to use stream for stateless to some of the PMD.
> They can be designed to do one-
> > time per op setup (such as mapping session params) during
> stream_create() in control path than data
> > path.
> >
> [Fiona] yes, we agreed that stream_create() should be called for every
> session and if it
> returns non-NULL the PMD needs it so op_attach_stream() must be called.
> However I've just realised we don't have a STATEFUL/STATELESS param on
> the xform, just on the op.
> So we could either add stateful/stateless param to stream_create() ?
> OR add stateful/stateless param to xform so it would be in session?

[Shally] No it shouldn't be as part of session or xform as sessions aren't stateless/stateful.
So, we shouldn't alter the current definition of session or xforms.
If we need to mention it, then it could be added as part of stream_create() as it's device specific and depending upon op_type() device can then setup stream resources.

> However, Shally, can you reconsider if you really need it for STATELESS or if
> the data you want to
> store there could be stored in the session? Or if it's needed per-op does it
> really need
> to be visible on the API as a stream or could it be hidden within the PMD?

[Shally] I would say it is not mandatory but a desirable feature that I am suggesting. 
I am only trying to enable optimization in data path which may be of help to some PMD designs as they can use stream_create() to do setup that are 1-time per op and regardless of op_type, such as I mentioned, setting up user session params to device sess params.
We can hide it inside PMD however there may be slight overhead in datapath depending on PMD design.
But I would say, it's not a blocker for us to freeze on current spec. We can revisit this feature later because it will not alter base API functionality.

Thanks
Shally

> 
> > >
> > > > 7.for every rte_comp_op in ops[],
> > > >      7.1 set up with src/dst buffer
> > > > 8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops,
> nb_ops);
> > > > 9. do while (dqu < enq) // Wait till all of enqueued are dequeued
> > > >     9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops,
> enq);
> > >
> > > [Ahmed] I am assuming that waiting for all enqueued to be dequeued is
> not
> > > strictly necessary, but is just the chosen example in this case
> > >
> >
> > [Shally] Yes. By design, for burst_size>1 each op is independent of each
> other. So app may proceed as soon
> > as it dequeue any.
> >
> > > > 10. Repeat 7 for next batch of data
> > > > 11. for every ops in ops[]
> > > >       11.1 rte_comp_stream_free(op->stream);
> > > > 11. rte_comp_session_clear (sess) ;
> > > > 12. rte_comp_session_terminate(ret_comp_sess *session)
> > > >
> > > > Thanks
> > > > Shally
> > > >
> > > >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-12 13:49       ` Verma, Shally
@ 2018-01-25 18:19         ` Ahmed Mansour
  2018-01-29 12:47           ` Verma, Shally
  2018-01-31 19:03           ` Trahe, Fiona
  0 siblings, 2 replies; 30+ messages in thread
From: Ahmed Mansour @ 2018-01-25 18:19 UTC (permalink / raw)
  To: Verma, Shally, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi All,

Sorry for the delay. Please see responses inline.

Ahmed

On 1/12/2018 8:50 AM, Verma, Shally wrote:
> Hi Fiona
>
>> -----Original Message-----
>> From: Trahe, Fiona [mailto:fiona.trahe@intel.com]
>> Sent: 12 January 2018 00:24
>> To: Verma, Shally <Shally.Verma@cavium.com>; Ahmed Mansour
>> <ahmed.mansour@nxp.com>; dev@dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
>> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>;
>> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>; Trahe,
>> Fiona <fiona.trahe@intel.com>
>> Subject: RE: [RFC v2] doc compression API for DPDK
>>
>> Hi Shally, Ahmed,
>>
>>
>>> -----Original Message-----
>>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>>> Sent: Wednesday, January 10, 2018 12:55 PM
>>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona
>> <fiona.trahe@intel.com>; dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
>> Gupta, Ashish
>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>;
>> De Lara Guarch, Pablo
>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
>> <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>> <deepak.k.jain@intel.com>; Hemant Agrawal
>> <hemant.agrawal@nxp.com>; Roy Pledge
>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: RE: [RFC v2] doc compression API for DPDK
>>>
>>> HI Ahmed
>>>
>>>> -----Original Message-----
>>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>>> Sent: 10 January 2018 00:38
>>>> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona
>>>> <fiona.trahe@intel.com>; dev@dpdk.org
>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
>>>> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
>>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K
>> <deepak.k.jain@intel.com>;
>>>> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>>
>>>> Hi Shally,
>>>>
>>>> Thanks for the summary. It is very helpful. Please see comments below
>>>>
>>>>
>>>> On 1/4/2018 6:45 AM, Verma, Shally wrote:
>>>>> This is an RFC v2 document to brief understanding and requirements on
>>>> compression API proposal in DPDK. It is based on "[RFC v3] Compression
>> API
>>>> in DPDK
>>>>
>> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpd
>> k.org%2Fdev%2Fpatchwork%2Fpatch%2F32331%2F&data=02%7C01%7Cahm
>> ed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea
>> 1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=JF
>>>> tOnJxajgXX7s3DMZ79K7VVM7TXO8lBd6rNeVlsHDg%3D&reserved=0 ".
>>>>> Intention of this document is to align on concepts built into
>> compression
>>>> API, its usage and identify further requirements.
>>>>> Going further it could be a base to Compression Module Programmer
>>>> Guide.
>>>>> Current scope is limited to
>>>>> - definition of the terminology which makes up foundation of
>> compression
>>>> API
>>>>> - typical API flow expected to use by applications
>>>>> - Stateless and Stateful operation definition and usage after RFC v1 doc
>>>> review
>>>>
>> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdev.
>>>> dpdk.narkive.com%2FCHS5l01B%2Fdpdk-dev-rfc-v1-doc-compression-
>> api-
>>>> for-
>>>>
>> dpdk&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473
>> fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6
>> 36506631207323264&sdata=Fy7xKIyxZX97i7vEM6NqgrvnqKrNrWOYLwIA5dEH
>>>> QNQ%3D&reserved=0
>>>>> 1. Overview
>>>>> ~~~~~~~~~~~
>>>>>
>>>>> A. Compression Methodologies in compression API
>>>>> ===========================================
>>>>> DPDK compression supports two types of compression methodologies:
>>>>> - Stateless - each data object is compressed individually without any
>>>> reference to previous data,
>>>>> - Stateful -  each data object is compressed with reference to previous
>> data
>>>> object i.e. history of data is needed for compression / decompression
>>>>> For more explanation, please refer RFC
>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
>> ww.ietf.org%2Frfc%2Frfc1951.txt&data=02%7C01%7Cahmed.mansour%40nx
>> p.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd9
>> 9c5c301635%7C0%7C0%7C636506631207323264&sdata=pfp2VX1w3UxH5YLcL
>>>> 2R%2BvKXNeS7jP46CsASq0B1SETw%3D&reserved=0
>>>>> To support both methodologies, DPDK compression introduces two key
>>>> concepts: Session and Stream.
>>>>> B. Notion of a session in compression API
>>>>> ==================================
>>>>> A Session in DPDK compression is a logical entity which is setup one-
>> time
>>>> with immutable parameters i.e. parameters that don't change across
>>>> operations and devices.
>>>>> A session can be shared across multiple devices and multiple operations
>>>> simultaneously.
>>>>> A typical Session parameters includes info such as:
>>>>> - compress / decompress
>>>>> - compression algorithm and associated configuration parameters
>>>>>
>>>>> Application can create different sessions on a device initialized with
>>>> same/different xforms. Once a session is initialized with one xform it
>> cannot
>>>> be re-initialized.
>>>>> C. Notion of stream in compression API
>>>>>  =======================================
>>>>> Unlike session which carry common set of information across
>> operations, a
>>>> stream in DPDK compression is a logical entity which identify related set
>> of
>>>> operations and carry operation specific information as needed by device
>>>> during its processing.
>>>>> It is device specific data structure which is opaque to application, setup
>> and
>>>> maintained by device.
>>>>> A stream can be used with *only* one op at a time i.e. no two
>> operations
>>>> can share same stream simultaneously.
>>>>> A stream is *must* for stateful ops processing and optional for
>> stateless
>>>> (Please see respective sections for more details).
>>>>> This enables sharing of a session by multiple threads handling different
>>>> data set as each op carry its own context (internal states, history buffers
>> et
>>>> el) in its attached stream.
>>>>> Application should call rte_comp_stream_create() and attach to op
>> before
>>>> beginning of  operation processing and free via rte_comp_stream_free()
>>>> after its complete.
>>>>> C. Notion of burst operations in compression API
>>>>>  =======================================
>>>>> A burst in DPDK compression is an array of operations where each op
>> carry
>>>> independent set of data. i.e. a burst can look like:
>>>>>                                       ----------------------------------------------------------------
>> -----
>>>> ------------------------------------
>>>>>               enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
>>>> op4.no_flush | op5.no_flush |)
>>>>>                                        ----------------------------------------------------------------
>> ----
>>>> -------------------------------------
>>>>> Where, op1 .. op5 are all independent of each other and carry entirely
>>>> different set of data.
>>>>> Each op can be attached to same/different session but *must* be
>> attached
>>>> to different stream.
>>>>> Each op (struct rte_comp_op) carry compression/decompression
>>>> operational parameter and is both an input/output parameter.
>>>>> PMD gets source, destination and checksum information at input and
>>>> update it with bytes consumed and produced and checksum at output.
>>>>> Since each operation in a burst is independent and thus can complete
>> out-
>>>> of-order,  applications which need ordering, should setup per-op user
>> data
>>>> area with reordering information so that it can determine enqueue order
>> at
>>>> deque.
>>>>> Also if multiple threads calls enqueue_burst() on same queue pair then
>> it's
>>>> application onus to use proper locking mechanism to ensure exclusive
>>>> enqueuing of operations.
>>>>> D. Stateless Vs Stateful
>>>>> ===================
>>>>> Compression API provide RTE_COMP_FF_STATEFUL feature flag for
>> PMD
>>>> to reflect its support for Stateful operation. Each op carry an op type
>>>> indicating if it's to be processed stateful or stateless.
>>>>> D.1 Compression API Stateless operation
>>>>> ------------------------------------------------------
>>>>> An op is processed stateless if it has
>>>>> -              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
>>>> (required only on compression side),
>>>>> -	 op_type set to RTE_COMP_OP_STATELESS
>>>>> -              All-of the required input and sufficient large output buffer to
>> store
>>>> output i.e. OUT_OF_SPACE can never occur.
>>>>> When all of the above conditions are met, PMD initiates stateless
>>>> processing and releases acquired resources after processing of current
>>>> operation is complete i.e. full input consumed and full output written.
>> [Fiona] I think 3rd condition conflicts with D1.1 below and anyway cannot be
>> a precondition. i.e.
>> PMD must initiate stateless processing based on RTE_COMP_OP_STATELESS.
>> It can't always know if the output buffer is big enough before processing, it
>> must process the input data and
>> only when it has consumed it all can it know that all the output data fits or
>> doesn't fit in the output buffer.
>>
>> I'd suggest rewording as follows:
>> An op is processed statelessly if op_type is set to RTE_COMP_OP_STATELESS
>> In this case
>> - The flush value must be set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
>> (required only on compression side),
>> - All of the input data must be in the src buffer
>> - The dst buffer should be sufficiently large enough to hold the expected
>> output
>> The PMD acquires the necessary resources to process the op. After
>> processing of current operation is
>> complete, whether successful or not, it releases acquired resources and no
>> state, history or data is
>> held in the PMD or carried over to subsequent ops.
>> In SUCCESS case full input is consumed and full output written and status is
>> set to RTE_COMP_OP_STATUS_SUCCESS.
>> OUT-OF-SPACE as D1.1 below.
>>
> [Shally] Ok. Agreed.
>
>>>>> Application can optionally attach a stream to such ops. In such case,
>>>> application must attach different stream to each op.
>>>>> Application can enqueue stateless burst via making consecutive
>>>> enque_burst() calls i.e. Following is relevant usage:
>>>>> enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
>>>>> enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
>>>>>
>>>>> *Note - Every call has different ops array i.e.  same rte_comp_op array
>>>> *cannot be re-enqueued* to process next batch of data until previous
>> ones
>>>> are completely processed.
>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>> ------------------------------------------------
>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>> and
>>>> where PMD still has more data to produce. If PMD run into such
>> condition,
>>>> then it's an error condition in stateless processing.
>>>>> In such case, PMD resets itself and return with status
>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>> i.e.
>>>> no input read, no output written.
>>>>> Application can resubmit an full input with larger output buffer size.
>>>> [Ahmed] Can we add an option to allow the user to read the data that
>> was
>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>> decompression applications doing search.
>>> [Shally] It is there but applicable for stateful operation type (please refer to
>> handling out_of_space under
>>> "Stateful Section").
>>> By definition, "stateless" here means that application (such as IPCOMP)
>> knows maximum output size
>>> guaranteedly and ensure that uncompressed data size cannot grow more
>> than provided output buffer.
>>> Such apps can submit an op with type = STATELESS and provide full input,
>> then PMD assume it has
>>> sufficient input and output and thus doesn't need to maintain any contexts
>> after op is processed.
>>> If application doesn't know about max output size, then it should process it
>> as stateful op i.e. setup op
>>> with type = STATEFUL and attach a stream so that PMD can maintain
>> relevant context to handle such
>>> condition.
>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>> respecting the stateless concept.
>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>> case
>> it could also return consumed=0, produced = x, where x>0. X indicates the
>> amount of valid data which has
>>  been written to the output buffer. It is not complete, but if an application
>> wants to search it it may be sufficient.
>> If the application still wants the data it must resubmit the whole input with a
>> bigger output buffer, and
>>  decompression will be repeated from the start, it
>>  cannot expect to continue on as the PMD has not maintained state, history
>> or data.
>> I don't think there would be any need to indicate this in capabilities, PMDs
>> which cannot provide this
>> functionality would always return produced=consumed=0, while PMDs which
>> can could set produced > 0.
>> If this works for you both, we could consider a similar case for compression.
>>
> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual consumed by PMD.
> Setting consumed = 0 with produced > 0 doesn't correlate. 
[Ahmed]I like Fiona's suggestion, but I also do not like the implication
of returning consumed = 0. At the same time returning consumed = y
implies to the user that it can proceed from the middle. I prefer the
consumed = 0 implementation, but I think a different return is needed to
distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
future PMD implementations to provide recover-ability even in STATELESS
mode if they so wish. In this model STATELESS or STATEFUL would be a
hint for the PMD implementation to make optimizations for each case, but
it does not force the PMD implementation to limit functionality if it
can provide recover-ability.
>
>>>>> D.2 Compression API Stateful operation
>>>>> ----------------------------------------------------------
>>>>>  A Stateful operation in DPDK compression means application invokes
>>>> enqueue burst() multiple times to process related chunk of data either
>>>> because
>>>>> - Application broke data into several ops, and/or
>>>>> - PMD ran into out_of_space situation during input processing
>>>>>
>>>>> In case of either one or all of the above conditions, PMD is required to
>>>> maintain state of op across enque_burst() calls and
>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>> D.2.1 Stateful operation state maintenance
>>>>> ---------------------------------------------------------------
>>>>> It is always an ideal expectation from application that it should parse
>>>> through all related chunk of source data making its mbuf-chain and
>> enqueue
>>>> it for stateless processing.
>>>>> However, if it need to break it into several enqueue_burst() calls, then
>> an
>>>> expected call flow would be something like:
>>>>> enqueue_burst( |op.no_flush |)
>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>> burst in a loop until all ops are received. Is this correct?
>>>>
>>>>> deque_burst(op) // should dequeue before we enqueue next
>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>> this illustration is specifically in
>>> context of stateful op processing to reflect if a stream is broken into
>> chunks, then each chunk should be
>>> submitted as one op at-a-time with type = STATEFUL and need to be
>> dequeued first before next chunk is
>>> enqueued.
>>>
>>>>> enqueue_burst( |op.no_flush |)
>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>> enqueue_burst( |op.full_flush |)
>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>> distinguish
>>>> the response in exception cases?
>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>> such case is independent of
>>> each other i.e. belong to different streams altogether.
>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>> related chunks of data in single
>>> burst by passing them as ops array but later found that as not-so-useful for
>> PMD handling for various
>>> reasons. You may please refer to RFC v1 doc review comments for same.
>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>> time, since each needs the
>> state of the previous, to allow more than 1 op to be in-flight at a time would
>> force PMDs to implement internal queueing and exception handling for
>> OUT_OF_SPACE conditions you mention.
[Ahmed] But we are putting the ops on qps which would make them
sequential. Handling OUT_OF_SPACE conditions would be a little bit more
complex but doable. The question is this mode of use useful for real
life applications or would we be just adding complexity? The technical
advantage of this is that processing of Stateful ops is interdependent
and PMDs can take advantage of caching and other optimizations to make
processing related ops much faster than switching on every op. PMDs have
maintain state of more than 32 KB for DEFLATE for every stream.
>> If the application has all the data, it can put it into chained mbufs in a single
>> op rather than
>> multiple ops, which avoids pushing all that complexity down to the PMDs.
[Ahmed] I think that your suggested scheme of putting all related mbufs
into one op may be the best solution without the extra complexity of
handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
time If we have a way of marking mbufs as ready for consumption. The
enqueuer may not have all the data at hand but can enqueue the op with a
couple of empty mbus marked as not ready for consumption. The enqueuer
will then update the rest of the mbufs to ready for consumption once the
data is added. This introduces a race condition. A second flag for each
mbuf can be updated by the PMD to indicate that it processed it or not.
This way in cases where the PMD beat the application to the op, the
application will just update the op to point to the first unprocessed
mbuf and resend it to the PMD.
>>
>>>>> Here an op *must* be attached to a stream and every subsequent
>>>> enqueue_burst() call should carry *same* stream. Since PMD maintain
>> ops
>>>> state in stream, thus it is mandatory for application to attach stream to
>> such
>>>> ops.
>> [Fiona] I think you're referring only to a single stream above, but as there
>> may be many different streams,
>> maybe add the following?
>> Above is simplified to show just a single stream. However there may be
>> many streams, and each
>> enqueue_burst() may contain ops from different streams, as long as there is
>> only one op in-flight from any
>> stream at a given time.
>>
> [Shally] Ok get it. 
>
>>>>> D.2.2 Stateful and Out_of_Space
>>>>> --------------------------------------------
>>>>> If PMD support stateful and run into OUT_OF_SPACE situation, then it is
>>>> not an error condition for PMD. In such case, PMD return with status
>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of
>> input
>>>> bytes read and produced = length of complete output buffer.
>> [Fiona] - produced would be <= output buffer len (typically =, but could be a
>> few bytes less)
>>
>>
>>>>> Application should enqueue op with source starting at consumed+1 and
>>>> output buffer with available space.
>>>>
>>>> [Ahmed] Related to OUT_OF_SPACE. What status does the user recieve
>> in a
>>>> decompression case when the end block is encountered before the end
>> of
>>>> the input? Does the PMD continue decomp? Does it stop there and
>> return
>>>> the stop index?
>>>>
>>> [Shally] Before I could answer this, please help me understand your use
>> case . When you say  "when the
>>> end block is encountered before the end of the input?" Do you mean -
>>> "Decompressor process a final block (i.e. has BFINAL=1 in its header) and
>> there's some footer data after
>>> that?" Or
>>> you mean "decompressor process one block and has more to process till its
>> final block?"
>>> What is "end block" and "end of input" reference here?
[Ahmed] I meant BFINAL=1 by end block. The end of input is the end of
the input length.
e.g.
| input
length--------------------------------------------------------------|
|--data----data----data------data-------BFINAL-footer-unrelated data|
>>>
>>>>> D.2.3 Sliding Window Size
>>>>> ------------------------------------
>>>>> Every PMD will reflect in its algorithm capability structure maximum
>> length
>>>> of Sliding Window in bytes which would indicate maximum history buffer
>>>> length used by algo.
>>>>> 2. Example API illustration
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>> [Fiona] I think it would be useful to show an example of both a STATELESS
>> flow and a STATEFUL flow.
>>
> [Shally] Ok. I can add simplified version to illustrate API usage in both cases.
>
>>>>> Following is an illustration on API usage  (This is just one flow, other
>> variants
>>>> are also possible):
>>>>> 1. rte_comp_session *sess = rte_compressdev_session_create
>>>> (rte_mempool *pool);
>>>>> 2. rte_compressdev_session_init (int dev_id, rte_comp_session *sess,
>>>> rte_comp_xform *xform, rte_mempool *sess_pool);
>>>>> 3. rte_comp_op_pool_create(rte_mempool ..)
>>>>> 4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
>>>> rte_comp_op **ops, uint16_t nb_ops);
>>>>> 5. for every rte_comp_op in ops[],
>>>>>     5.1 rte_comp_op_attach_session (rte_comp_op *op,
>> rte_comp_session
>>>> *sess);
>>>>>     5.2 op.op_type = RTE_COMP_OP_STATELESS
>>>>>     5.3 op.flush = RTE_FLUSH_FINAL
>>>>> 6. [Optional] for every rte_comp_op in ops[],
>>>>>     6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess,
>> void
>>>> **stream);
>>>>>     6.2 rte_comp_op_attach_stream(rte_comp_op *op,
>> rte_comp_session
>>>> *stream);
>>>>
>>>> [Ahmed] What is the semantic effect of attaching a stream to every op?
>> will
>>>> this application benefit for this given that it is setup with op_type
>> STATELESS
>>> [Shally] By role, stream is data structure that hold all information that PMD
>> need to maintain for an op
>>> processing and thus it's marked device specific. It is required for stateful
>> processing but optional for
>>> statelss as PMD doesn't need to maintain context once op is processed
>> unlike stateful.
>>> It may be of advantage to use stream for stateless to some of the PMD.
>> They can be designed to do one-
>>> time per op setup (such as mapping session params) during
>> stream_create() in control path than data
>>> path.
>>>
>> [Fiona] yes, we agreed that stream_create() should be called for every
>> session and if it
>> returns non-NULL the PMD needs it so op_attach_stream() must be called.
>> However I've just realised we don't have a STATEFUL/STATELESS param on
>> the xform, just on the op.
>> So we could either add stateful/stateless param to stream_create() ?
>> OR add stateful/stateless param to xform so it would be in session?
> [Shally] No it shouldn't be as part of session or xform as sessions aren't stateless/stateful.
> So, we shouldn't alter the current definition of session or xforms.
> If we need to mention it, then it could be added as part of stream_create() as it's device specific and depending upon op_type() device can then setup stream resources.
>
>> However, Shally, can you reconsider if you really need it for STATELESS or if
>> the data you want to
>> store there could be stored in the session? Or if it's needed per-op does it
>> really need
>> to be visible on the API as a stream or could it be hidden within the PMD?
> [Shally] I would say it is not mandatory but a desirable feature that I am suggesting. 
> I am only trying to enable optimization in data path which may be of help to some PMD designs as they can use stream_create() to do setup that are 1-time per op and regardless of op_type, such as I mentioned, setting up user session params to device sess params.
> We can hide it inside PMD however there may be slight overhead in datapath depending on PMD design.
> But I would say, it's not a blocker for us to freeze on current spec. We can revisit this feature later because it will not alter base API functionality.
>
> Thanks
> Shally
>
>>>>> 7.for every rte_comp_op in ops[],
>>>>>      7.1 set up with src/dst buffer
>>>>> 8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops,
>> nb_ops);
>>>>> 9. do while (dqu < enq) // Wait till all of enqueued are dequeued
>>>>>     9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops,
>> enq);
>>>> [Ahmed] I am assuming that waiting for all enqueued to be dequeued is
>> not
>>>> strictly necessary, but is just the chosen example in this case
>>>>
>>> [Shally] Yes. By design, for burst_size>1 each op is independent of each
>> other. So app may proceed as soon
>>> as it dequeue any.
>>>
>>>>> 10. Repeat 7 for next batch of data
>>>>> 11. for every ops in ops[]
>>>>>       11.1 rte_comp_stream_free(op->stream);
>>>>> 11. rte_comp_session_clear (sess) ;
>>>>> 12. rte_comp_session_terminate(ret_comp_sess *session)
>>>>>
>>>>> Thanks
>>>>> Shally
>>>>>
>>>>>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-25 18:19         ` Ahmed Mansour
@ 2018-01-29 12:47           ` Verma, Shally
  2018-01-31 19:03           ` Trahe, Fiona
  1 sibling, 0 replies; 30+ messages in thread
From: Verma, Shally @ 2018-01-29 12:47 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi Ahmed

> -----Original Message-----
> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> Sent: 25 January 2018 23:49
> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona
> <fiona.trahe@intel.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>;
> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: Re: [RFC v2] doc compression API for DPDK
> 
> Hi All,
> 
> Sorry for the delay. Please see responses inline.
> 
> Ahmed
> 
> On 1/12/2018 8:50 AM, Verma, Shally wrote:
> > Hi Fiona
> >
> >> -----Original Message-----
> >> From: Trahe, Fiona [mailto:fiona.trahe@intel.com]
> >> Sent: 12 January 2018 00:24
> >> To: Verma, Shally <Shally.Verma@cavium.com>; Ahmed Mansour
> >> <ahmed.mansour@nxp.com>; dev@dpdk.org
> >> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> >> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> >> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> >> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> >> <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>;
> >> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> >> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>;
> Trahe,
> >> Fiona <fiona.trahe@intel.com>
> >> Subject: RE: [RFC v2] doc compression API for DPDK
> >>
> >> Hi Shally, Ahmed,
> >>
> >>
> >>> -----Original Message-----
> >>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> >>> Sent: Wednesday, January 10, 2018 12:55 PM
> >>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona
> >> <fiona.trahe@intel.com>; dev@dpdk.org
> >>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>;
> >> Gupta, Ashish
> >>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>;
> >> De Lara Guarch, Pablo
> >>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> >> <Mahipal.Challa@cavium.com>; Jain, Deepak K
> >>> <deepak.k.jain@intel.com>; Hemant Agrawal
> >> <hemant.agrawal@nxp.com>; Roy Pledge
> >>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >>> Subject: RE: [RFC v2] doc compression API for DPDK
> >>>
> >>> HI Ahmed
> >>>
> >>>> -----Original Message-----
> >>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> >>>> Sent: 10 January 2018 00:38
> >>>> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona
> >>>> <fiona.trahe@intel.com>; dev@dpdk.org
> >>>> Cc: Athreya, Narayana Prasad
> <NarayanaPrasad.Athreya@cavium.com>;
> >>>> Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
> >>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> >>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
> >>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K
> >> <deepak.k.jain@intel.com>;
> >>>> Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> >>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >>>> Subject: Re: [RFC v2] doc compression API for DPDK
> >>>>
> >>>> Hi Shally,
> >>>>
> >>>> Thanks for the summary. It is very helpful. Please see comments below
> >>>>
> >>>>
> >>>> On 1/4/2018 6:45 AM, Verma, Shally wrote:
> >>>>> This is an RFC v2 document to brief understanding and requirements
> on
> >>>> compression API proposal in DPDK. It is based on "[RFC v3]
> Compression
> >> API
> >>>> in DPDK
> >>>>
> >>
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpd
> >>
> k.org%2Fdev%2Fpatchwork%2Fpatch%2F32331%2F&data=02%7C01%7Cahm
> >>
> ed.mansour%40nxp.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea
> >>
> 1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636506631207323264&sdata=JF
> >>>> tOnJxajgXX7s3DMZ79K7VVM7TXO8lBd6rNeVlsHDg%3D&reserved=0 ".
> >>>>> Intention of this document is to align on concepts built into
> >> compression
> >>>> API, its usage and identify further requirements.
> >>>>> Going further it could be a base to Compression Module Programmer
> >>>> Guide.
> >>>>> Current scope is limited to
> >>>>> - definition of the terminology which makes up foundation of
> >> compression
> >>>> API
> >>>>> - typical API flow expected to use by applications
> >>>>> - Stateless and Stateful operation definition and usage after RFC v1
> doc
> >>>> review
> >>>>
> >>
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdev.
> >>>> dpdk.narkive.com%2FCHS5l01B%2Fdpdk-dev-rfc-v1-doc-compression-
> >> api-
> >>>> for-
> >>>>
> >>
> dpdk&data=02%7C01%7Cahmed.mansour%40nxp.com%7C80bd3270430c473
> >>
> fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6
> >>
> 36506631207323264&sdata=Fy7xKIyxZX97i7vEM6NqgrvnqKrNrWOYLwIA5dEH
> >>>> QNQ%3D&reserved=0
> >>>>> 1. Overview
> >>>>> ~~~~~~~~~~~
> >>>>>
> >>>>> A. Compression Methodologies in compression API
> >>>>> ===========================================
> >>>>> DPDK compression supports two types of compression
> methodologies:
> >>>>> - Stateless - each data object is compressed individually without any
> >>>> reference to previous data,
> >>>>> - Stateful -  each data object is compressed with reference to previous
> >> data
> >>>> object i.e. history of data is needed for compression / decompression
> >>>>> For more explanation, please refer RFC
> >>
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fw
> >>
> ww.ietf.org%2Frfc%2Frfc1951.txt&data=02%7C01%7Cahmed.mansour%40nx
> >>
> p.com%7C80bd3270430c473fa71d08d55368a0e1%7C686ea1d3bc2b4c6fa92cd9
> >>
> 9c5c301635%7C0%7C0%7C636506631207323264&sdata=pfp2VX1w3UxH5YLcL
> >>>> 2R%2BvKXNeS7jP46CsASq0B1SETw%3D&reserved=0
> >>>>> To support both methodologies, DPDK compression introduces two
> key
> >>>> concepts: Session and Stream.
> >>>>> B. Notion of a session in compression API
> >>>>> ==================================
> >>>>> A Session in DPDK compression is a logical entity which is setup one-
> >> time
> >>>> with immutable parameters i.e. parameters that don't change across
> >>>> operations and devices.
> >>>>> A session can be shared across multiple devices and multiple
> operations
> >>>> simultaneously.
> >>>>> A typical Session parameters includes info such as:
> >>>>> - compress / decompress
> >>>>> - compression algorithm and associated configuration parameters
> >>>>>
> >>>>> Application can create different sessions on a device initialized with
> >>>> same/different xforms. Once a session is initialized with one xform it
> >> cannot
> >>>> be re-initialized.
> >>>>> C. Notion of stream in compression API
> >>>>>  =======================================
> >>>>> Unlike session which carry common set of information across
> >> operations, a
> >>>> stream in DPDK compression is a logical entity which identify related set
> >> of
> >>>> operations and carry operation specific information as needed by
> device
> >>>> during its processing.
> >>>>> It is device specific data structure which is opaque to application,
> setup
> >> and
> >>>> maintained by device.
> >>>>> A stream can be used with *only* one op at a time i.e. no two
> >> operations
> >>>> can share same stream simultaneously.
> >>>>> A stream is *must* for stateful ops processing and optional for
> >> stateless
> >>>> (Please see respective sections for more details).
> >>>>> This enables sharing of a session by multiple threads handling
> different
> >>>> data set as each op carry its own context (internal states, history
> buffers
> >> et
> >>>> el) in its attached stream.
> >>>>> Application should call rte_comp_stream_create() and attach to op
> >> before
> >>>> beginning of  operation processing and free via
> rte_comp_stream_free()
> >>>> after its complete.
> >>>>> C. Notion of burst operations in compression API
> >>>>>  =======================================
> >>>>> A burst in DPDK compression is an array of operations where each op
> >> carry
> >>>> independent set of data. i.e. a burst can look like:
> >>>>>                                       --------------------------------------------------------------
> --
> >> -----
> >>>> ------------------------------------
> >>>>>               enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
> >>>> op4.no_flush | op5.no_flush |)
> >>>>>                                        --------------------------------------------------------------
> --
> >> ----
> >>>> -------------------------------------
> >>>>> Where, op1 .. op5 are all independent of each other and carry entirely
> >>>> different set of data.
> >>>>> Each op can be attached to same/different session but *must* be
> >> attached
> >>>> to different stream.
> >>>>> Each op (struct rte_comp_op) carry compression/decompression
> >>>> operational parameter and is both an input/output parameter.
> >>>>> PMD gets source, destination and checksum information at input and
> >>>> update it with bytes consumed and produced and checksum at output.
> >>>>> Since each operation in a burst is independent and thus can complete
> >> out-
> >>>> of-order,  applications which need ordering, should setup per-op user
> >> data
> >>>> area with reordering information so that it can determine enqueue
> order
> >> at
> >>>> deque.
> >>>>> Also if multiple threads calls enqueue_burst() on same queue pair
> then
> >> it's
> >>>> application onus to use proper locking mechanism to ensure exclusive
> >>>> enqueuing of operations.
> >>>>> D. Stateless Vs Stateful
> >>>>> ===================
> >>>>> Compression API provide RTE_COMP_FF_STATEFUL feature flag for
> >> PMD
> >>>> to reflect its support for Stateful operation. Each op carry an op type
> >>>> indicating if it's to be processed stateful or stateless.
> >>>>> D.1 Compression API Stateless operation
> >>>>> ------------------------------------------------------
> >>>>> An op is processed stateless if it has
> >>>>> -              flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
> >>>> (required only on compression side),
> >>>>> -	 op_type set to RTE_COMP_OP_STATELESS
> >>>>> -              All-of the required input and sufficient large output buffer to
> >> store
> >>>> output i.e. OUT_OF_SPACE can never occur.
> >>>>> When all of the above conditions are met, PMD initiates stateless
> >>>> processing and releases acquired resources after processing of current
> >>>> operation is complete i.e. full input consumed and full output written.
> >> [Fiona] I think 3rd condition conflicts with D1.1 below and anyway cannot
> be
> >> a precondition. i.e.
> >> PMD must initiate stateless processing based on
> RTE_COMP_OP_STATELESS.
> >> It can't always know if the output buffer is big enough before processing,
> it
> >> must process the input data and
> >> only when it has consumed it all can it know that all the output data fits or
> >> doesn't fit in the output buffer.
> >>
> >> I'd suggest rewording as follows:
> >> An op is processed statelessly if op_type is set to
> RTE_COMP_OP_STATELESS
> >> In this case
> >> - The flush value must be set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
> >> (required only on compression side),
> >> - All of the input data must be in the src buffer
> >> - The dst buffer should be sufficiently large enough to hold the expected
> >> output
> >> The PMD acquires the necessary resources to process the op. After
> >> processing of current operation is
> >> complete, whether successful or not, it releases acquired resources and
> no
> >> state, history or data is
> >> held in the PMD or carried over to subsequent ops.
> >> In SUCCESS case full input is consumed and full output written and status
> is
> >> set to RTE_COMP_OP_STATUS_SUCCESS.
> >> OUT-OF-SPACE as D1.1 below.
> >>
> > [Shally] Ok. Agreed.
> >
> >>>>> Application can optionally attach a stream to such ops. In such case,
> >>>> application must attach different stream to each op.
> >>>>> Application can enqueue stateless burst via making consecutive
> >>>> enque_burst() calls i.e. Following is relevant usage:
> >>>>> enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
> >>>>> enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
> >>>>>
> >>>>> *Note - Every call has different ops array i.e.  same rte_comp_op
> array
> >>>> *cannot be re-enqueued* to process next batch of data until previous
> >> ones
> >>>> are completely processed.
> >>>>> D.1.1 Stateless and OUT_OF_SPACE
> >>>>> ------------------------------------------------
> >>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
> >> and
> >>>> where PMD still has more data to produce. If PMD run into such
> >> condition,
> >>>> then it's an error condition in stateless processing.
> >>>>> In such case, PMD resets itself and return with status
> >>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with
> produced=consumed=0
> >> i.e.
> >>>> no input read, no output written.
> >>>>> Application can resubmit an full input with larger output buffer size.
> >>>> [Ahmed] Can we add an option to allow the user to read the data that
> >> was
> >>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
> >>>> decompression applications doing search.
> >>> [Shally] It is there but applicable for stateful operation type (please refer
> to
> >> handling out_of_space under
> >>> "Stateful Section").
> >>> By definition, "stateless" here means that application (such as IPCOMP)
> >> knows maximum output size
> >>> guaranteedly and ensure that uncompressed data size cannot grow
> more
> >> than provided output buffer.
> >>> Such apps can submit an op with type = STATELESS and provide full input,
> >> then PMD assume it has
> >>> sufficient input and output and thus doesn't need to maintain any
> contexts
> >> after op is processed.
> >>> If application doesn't know about max output size, then it should
> process it
> >> as stateful op i.e. setup op
> >>> with type = STATEFUL and attach a stream so that PMD can maintain
> >> relevant context to handle such
> >>> condition.
> >> [Fiona] There may be an alternative that's useful for Ahmed, while still
> >> respecting the stateless concept.
> >> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
> >> case
> >> it could also return consumed=0, produced = x, where x>0. X indicates the
> >> amount of valid data which has
> >>  been written to the output buffer. It is not complete, but if an application
> >> wants to search it it may be sufficient.
> >> If the application still wants the data it must resubmit the whole input
> with a
> >> bigger output buffer, and
> >>  decompression will be repeated from the start, it
> >>  cannot expect to continue on as the PMD has not maintained state,
> history
> >> or data.
> >> I don't think there would be any need to indicate this in capabilities, PMDs
> >> which cannot provide this
> >> functionality would always return produced=consumed=0, while PMDs
> which
> >> can could set produced > 0.
> >> If this works for you both, we could consider a similar case for
> compression.
> >>
> > [Shally] Sounds Fine to me. Though then in that case, consume should also
> be updated to actual consumed by PMD.
> > Setting consumed = 0 with produced > 0 doesn't correlate.
> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
> of returning consumed = 0. At the same time returning consumed = y
> implies to the user that it can proceed from the middle. I prefer the
> consumed = 0 implementation, but I think a different return is needed to
> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also
> allows
> future PMD implementations to provide recover-ability even in STATELESS
> mode if they so wish. In this model STATELESS or STATEFUL would be a
> hint for the PMD implementation to make optimizations for each case, but
> it does not force the PMD implementation to limit functionality if it
> can provide recover-ability.
> >
> >>>>> D.2 Compression API Stateful operation
> >>>>> ----------------------------------------------------------
> >>>>>  A Stateful operation in DPDK compression means application invokes
> >>>> enqueue burst() multiple times to process related chunk of data either
> >>>> because
> >>>>> - Application broke data into several ops, and/or
> >>>>> - PMD ran into out_of_space situation during input processing
> >>>>>
> >>>>> In case of either one or all of the above conditions, PMD is required to
> >>>> maintain state of op across enque_burst() calls and
> >>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin
> with
> >>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> >>>> RTE_COMP_FULL/FINAL_FLUSH.
> >>>>> D.2.1 Stateful operation state maintenance
> >>>>> ---------------------------------------------------------------
> >>>>> It is always an ideal expectation from application that it should parse
> >>>> through all related chunk of source data making its mbuf-chain and
> >> enqueue
> >>>> it for stateless processing.
> >>>>> However, if it need to break it into several enqueue_burst() calls,
> then
> >> an
> >>>> expected call flow would be something like:
> >>>>> enqueue_burst( |op.no_flush |)
> >>>> [Ahmed] The work is now in flight to the PMD.The user will call
> dequeue
> >>>> burst in a loop until all ops are received. Is this correct?
> >>>>
> >>>>> deque_burst(op) // should dequeue before we enqueue next
> >>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >> this illustration is specifically in
> >>> context of stateful op processing to reflect if a stream is broken into
> >> chunks, then each chunk should be
> >>> submitted as one op at-a-time with type = STATEFUL and need to be
> >> dequeued first before next chunk is
> >>> enqueued.
> >>>
> >>>>> enqueue_burst( |op.no_flush |)
> >>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>> enqueue_burst( |op.full_flush |)
> >>>> [Ahmed] Why now allow multiple work items in flight? I understand
> that
> >>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >> distinguish
> >>>> the response in exception cases?
> >>> [Shally] Multiples ops are allowed in flight, however condition is each op
> in
> >> such case is independent of
> >>> each other i.e. belong to different streams altogether.
> >>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >> related chunks of data in single
> >>> burst by passing them as ops array but later found that as not-so-useful
> for
> >> PMD handling for various
> >>> reasons. You may please refer to RFC v1 doc review comments for same.
> >> [Fiona] Agree with Shally. In summary, as only one op can be processed at
> a
> >> time, since each needs the
> >> state of the previous, to allow more than 1 op to be in-flight at a time
> would
> >> force PMDs to implement internal queueing and exception handling for
> >> OUT_OF_SPACE conditions you mention.
> [Ahmed] But we are putting the ops on qps which would make them
> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> complex but doable. The question is this mode of use useful for real
> life applications or would we be just adding complexity? The technical
> advantage of this is that processing of Stateful ops is interdependent
> and PMDs can take advantage of caching and other optimizations to make
> processing related ops much faster than switching on every op. PMDs have
> maintain state of more than 32 KB for DEFLATE for every stream.
> >> If the application has all the data, it can put it into chained mbufs in a
> single
> >> op rather than
> >> multiple ops, which avoids pushing all that complexity down to the PMDs.
> [Ahmed] I think that your suggested scheme of putting all related mbufs
> into one op may be the best solution without the extra complexity of
> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
> time If we have a way of marking mbufs as ready for consumption. The
> enqueuer may not have all the data at hand but can enqueue the op with a
> couple of empty mbus marked as not ready for consumption. The enqueuer
> will then update the rest of the mbufs to ready for consumption once the
> data is added. This introduces a race condition. A second flag for each
> mbuf can be updated by the PMD to indicate that it processed it or not.
> This way in cases where the PMD beat the application to the op, the
> application will just update the op to point to the first unprocessed
> mbuf and resend it to the PMD.
> >>
> >>>>> Here an op *must* be attached to a stream and every subsequent
> >>>> enqueue_burst() call should carry *same* stream. Since PMD maintain
> >> ops
> >>>> state in stream, thus it is mandatory for application to attach stream to
> >> such
> >>>> ops.
> >> [Fiona] I think you're referring only to a single stream above, but as there
> >> may be many different streams,
> >> maybe add the following?
> >> Above is simplified to show just a single stream. However there may be
> >> many streams, and each
> >> enqueue_burst() may contain ops from different streams, as long as
> there is
> >> only one op in-flight from any
> >> stream at a given time.
> >>
> > [Shally] Ok get it.
> >
> >>>>> D.2.2 Stateful and Out_of_Space
> >>>>> --------------------------------------------
> >>>>> If PMD support stateful and run into OUT_OF_SPACE situation, then it
> is
> >>>> not an error condition for PMD. In such case, PMD return with status
> >>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with consumed = number of
> >> input
> >>>> bytes read and produced = length of complete output buffer.
> >> [Fiona] - produced would be <= output buffer len (typically =, but could
> be a
> >> few bytes less)
> >>
> >>
> >>>>> Application should enqueue op with source starting at consumed+1
> and
> >>>> output buffer with available space.
> >>>>
> >>>> [Ahmed] Related to OUT_OF_SPACE. What status does the user
> recieve
> >> in a
> >>>> decompression case when the end block is encountered before the
> end
> >> of
> >>>> the input? Does the PMD continue decomp? Does it stop there and
> >> return
> >>>> the stop index?
> >>>>
> >>> [Shally] Before I could answer this, please help me understand your use
> >> case . When you say  "when the
> >>> end block is encountered before the end of the input?" Do you mean -
> >>> "Decompressor process a final block (i.e. has BFINAL=1 in its header) and
> >> there's some footer data after
> >>> that?" Or
> >>> you mean "decompressor process one block and has more to process till
> its
> >> final block?"
> >>> What is "end block" and "end of input" reference here?
> [Ahmed] I meant BFINAL=1 by end block. The end of input is the end of
> the input length.
> e.g.
> | input
> length--------------------------------------------------------------|
> |--data----data----data------data-------BFINAL-footer-unrelated data|

[Shally] I will respond to this with my understanding and wait for Fiona to respond first on rest of above comments.

So, if decompressor encounter a final block before the end of actual input, then it ideally should continue to decompress the final block and consume input till it sees its end-of-block marker.
Normally decompressor don't process the data after it has finished processing the Final block so unprocessed trailing data may be passed as is back to application with 
'consumed = length of input till end-of-final-block' and 'status = SUCCESS/Out_of_space' (Out of space here imply output buffer ran out of space while writing decompressed data to it). 

Thanks
Shally
> >>>
> >>>>> D.2.3 Sliding Window Size
> >>>>> ------------------------------------
> >>>>> Every PMD will reflect in its algorithm capability structure maximum
> >> length
> >>>> of Sliding Window in bytes which would indicate maximum history
> buffer
> >>>> length used by algo.
> >>>>> 2. Example API illustration
> >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >> [Fiona] I think it would be useful to show an example of both a STATELESS
> >> flow and a STATEFUL flow.
> >>
> > [Shally] Ok. I can add simplified version to illustrate API usage in both cases.
> >
> >>>>> Following is an illustration on API usage  (This is just one flow, other
> >> variants
> >>>> are also possible):
> >>>>> 1. rte_comp_session *sess = rte_compressdev_session_create
> >>>> (rte_mempool *pool);
> >>>>> 2. rte_compressdev_session_init (int dev_id, rte_comp_session
> *sess,
> >>>> rte_comp_xform *xform, rte_mempool *sess_pool);
> >>>>> 3. rte_comp_op_pool_create(rte_mempool ..)
> >>>>> 4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
> >>>> rte_comp_op **ops, uint16_t nb_ops);
> >>>>> 5. for every rte_comp_op in ops[],
> >>>>>     5.1 rte_comp_op_attach_session (rte_comp_op *op,
> >> rte_comp_session
> >>>> *sess);
> >>>>>     5.2 op.op_type = RTE_COMP_OP_STATELESS
> >>>>>     5.3 op.flush = RTE_FLUSH_FINAL
> >>>>> 6. [Optional] for every rte_comp_op in ops[],
> >>>>>     6.1 rte_comp_stream_create(int dev_id, rte_comp_session *sess,
> >> void
> >>>> **stream);
> >>>>>     6.2 rte_comp_op_attach_stream(rte_comp_op *op,
> >> rte_comp_session
> >>>> *stream);
> >>>>
> >>>> [Ahmed] What is the semantic effect of attaching a stream to every
> op?
> >> will
> >>>> this application benefit for this given that it is setup with op_type
> >> STATELESS
> >>> [Shally] By role, stream is data structure that hold all information that
> PMD
> >> need to maintain for an op
> >>> processing and thus it's marked device specific. It is required for stateful
> >> processing but optional for
> >>> statelss as PMD doesn't need to maintain context once op is processed
> >> unlike stateful.
> >>> It may be of advantage to use stream for stateless to some of the PMD.
> >> They can be designed to do one-
> >>> time per op setup (such as mapping session params) during
> >> stream_create() in control path than data
> >>> path.
> >>>
> >> [Fiona] yes, we agreed that stream_create() should be called for every
> >> session and if it
> >> returns non-NULL the PMD needs it so op_attach_stream() must be
> called.
> >> However I've just realised we don't have a STATEFUL/STATELESS param
> on
> >> the xform, just on the op.
> >> So we could either add stateful/stateless param to stream_create() ?
> >> OR add stateful/stateless param to xform so it would be in session?
> > [Shally] No it shouldn't be as part of session or xform as sessions aren't
> stateless/stateful.
> > So, we shouldn't alter the current definition of session or xforms.
> > If we need to mention it, then it could be added as part of stream_create()
> as it's device specific and depending upon op_type() device can then setup
> stream resources.
> >
> >> However, Shally, can you reconsider if you really need it for STATELESS or
> if
> >> the data you want to
> >> store there could be stored in the session? Or if it's needed per-op does it
> >> really need
> >> to be visible on the API as a stream or could it be hidden within the PMD?
> > [Shally] I would say it is not mandatory but a desirable feature that I am
> suggesting.
> > I am only trying to enable optimization in data path which may be of help to
> some PMD designs as they can use stream_create() to do setup that are 1-
> time per op and regardless of op_type, such as I mentioned, setting up user
> session params to device sess params.
> > We can hide it inside PMD however there may be slight overhead in
> datapath depending on PMD design.
> > But I would say, it's not a blocker for us to freeze on current spec. We can
> revisit this feature later because it will not alter base API functionality.
> >
> > Thanks
> > Shally
> >
> >>>>> 7.for every rte_comp_op in ops[],
> >>>>>      7.1 set up with src/dst buffer
> >>>>> 8. enq = rte_compressdev_enqueue_burst (dev_id, qp_id, &ops,
> >> nb_ops);
> >>>>> 9. do while (dqu < enq) // Wait till all of enqueued are dequeued
> >>>>>     9.1 dqu = rte_compressdev_dequeue_burst (dev_id, qp_id, &ops,
> >> enq);
> >>>> [Ahmed] I am assuming that waiting for all enqueued to be dequeued
> is
> >> not
> >>>> strictly necessary, but is just the chosen example in this case
> >>>>
> >>> [Shally] Yes. By design, for burst_size>1 each op is independent of each
> >> other. So app may proceed as soon
> >>> as it dequeue any.
> >>>
> >>>>> 10. Repeat 7 for next batch of data
> >>>>> 11. for every ops in ops[]
> >>>>>       11.1 rte_comp_stream_free(op->stream);
> >>>>> 11. rte_comp_session_clear (sess) ;
> >>>>> 12. rte_comp_session_terminate(ret_comp_sess *session)
> >>>>>
> >>>>> Thanks
> >>>>> Shally
> >>>>>
> >>>>>
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-25 18:19         ` Ahmed Mansour
  2018-01-29 12:47           ` Verma, Shally
@ 2018-01-31 19:03           ` Trahe, Fiona
  2018-02-01  5:40             ` Verma, Shally
  2018-02-01 20:23             ` Ahmed Mansour
  1 sibling, 2 replies; 30+ messages in thread
From: Trahe, Fiona @ 2018-01-31 19:03 UTC (permalink / raw)
  To: Ahmed Mansour, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona

Hi Ahmed, Shally,

///snip///
> >>>>> D.1.1 Stateless and OUT_OF_SPACE
> >>>>> ------------------------------------------------
> >>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
> >> and
> >>>> where PMD still has more data to produce. If PMD run into such
> >> condition,
> >>>> then it's an error condition in stateless processing.
> >>>>> In such case, PMD resets itself and return with status
> >>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
> >> i.e.
> >>>> no input read, no output written.
> >>>>> Application can resubmit an full input with larger output buffer size.
> >>>> [Ahmed] Can we add an option to allow the user to read the data that
> >> was
> >>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
> >>>> decompression applications doing search.
> >>> [Shally] It is there but applicable for stateful operation type (please refer to
> >> handling out_of_space under
> >>> "Stateful Section").
> >>> By definition, "stateless" here means that application (such as IPCOMP)
> >> knows maximum output size
> >>> guaranteedly and ensure that uncompressed data size cannot grow more
> >> than provided output buffer.
> >>> Such apps can submit an op with type = STATELESS and provide full input,
> >> then PMD assume it has
> >>> sufficient input and output and thus doesn't need to maintain any contexts
> >> after op is processed.
> >>> If application doesn't know about max output size, then it should process it
> >> as stateful op i.e. setup op
> >>> with type = STATEFUL and attach a stream so that PMD can maintain
> >> relevant context to handle such
> >>> condition.
> >> [Fiona] There may be an alternative that's useful for Ahmed, while still
> >> respecting the stateless concept.
> >> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
> >> case
> >> it could also return consumed=0, produced = x, where x>0. X indicates the
> >> amount of valid data which has
> >>  been written to the output buffer. It is not complete, but if an application
> >> wants to search it it may be sufficient.
> >> If the application still wants the data it must resubmit the whole input with a
> >> bigger output buffer, and
> >>  decompression will be repeated from the start, it
> >>  cannot expect to continue on as the PMD has not maintained state, history
> >> or data.
> >> I don't think there would be any need to indicate this in capabilities, PMDs
> >> which cannot provide this
> >> functionality would always return produced=consumed=0, while PMDs which
> >> can could set produced > 0.
> >> If this works for you both, we could consider a similar case for compression.
> >>
> > [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
> consumed by PMD.
> > Setting consumed = 0 with produced > 0 doesn't correlate.
> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
> of returning consumed = 0. At the same time returning consumed = y
> implies to the user that it can proceed from the middle. I prefer the
> consumed = 0 implementation, but I think a different return is needed to
> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
> future PMD implementations to provide recover-ability even in STATELESS
> mode if they so wish. In this model STATELESS or STATEFUL would be a
> hint for the PMD implementation to make optimizations for each case, but
> it does not force the PMD implementation to limit functionality if it
> can provide recover-ability.
[Fiona] So you're suggesting the following:
OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
    can be used and next op in stream should continue on from op.consumed+1.
OUT_OF_SPACE_TERMINATED - returned only on stateless operation. 
    Error condition, no recovery possible.
    consumed=produced=0. Application must resubmit all input data with
    a bigger output buffer.
OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
     - consumed = 0, produced > 0. Application must resubmit all input data with
        a bigger output buffer. However in decompression case, data up to produced 
        in dst buffer may be inspected/searched. Never happens in compression 
        case as output data would be meaningless.
     - consumed > 0, produced > 0. PMD has stored relevant state and history and so
        can convert to stateful, using op.produced and continuing from consumed+1. 
I don't expect our PMDs to use this last case, but maybe this works for others?
I'm not convinced it's not just adding complexity. It sounds like a version of stateful 
without a stream, and maybe less efficient?
If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
simply have submitted a STATEFUL request if this is the behaviour it wants?
 



> >
> >>>>> D.2 Compression API Stateful operation
> >>>>> ----------------------------------------------------------
> >>>>>  A Stateful operation in DPDK compression means application invokes
> >>>> enqueue burst() multiple times to process related chunk of data either
> >>>> because
> >>>>> - Application broke data into several ops, and/or
> >>>>> - PMD ran into out_of_space situation during input processing
> >>>>>
> >>>>> In case of either one or all of the above conditions, PMD is required to
> >>>> maintain state of op across enque_burst() calls and
> >>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> >>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> >>>> RTE_COMP_FULL/FINAL_FLUSH.
> >>>>> D.2.1 Stateful operation state maintenance
> >>>>> ---------------------------------------------------------------
> >>>>> It is always an ideal expectation from application that it should parse
> >>>> through all related chunk of source data making its mbuf-chain and
> >> enqueue
> >>>> it for stateless processing.
> >>>>> However, if it need to break it into several enqueue_burst() calls, then
> >> an
> >>>> expected call flow would be something like:
> >>>>> enqueue_burst( |op.no_flush |)
> >>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> >>>> burst in a loop until all ops are received. Is this correct?
> >>>>
> >>>>> deque_burst(op) // should dequeue before we enqueue next
> >>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >> this illustration is specifically in
> >>> context of stateful op processing to reflect if a stream is broken into
> >> chunks, then each chunk should be
> >>> submitted as one op at-a-time with type = STATEFUL and need to be
> >> dequeued first before next chunk is
> >>> enqueued.
> >>>
> >>>>> enqueue_burst( |op.no_flush |)
> >>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>> enqueue_burst( |op.full_flush |)
> >>>> [Ahmed] Why now allow multiple work items in flight? I understand that
> >>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >> distinguish
> >>>> the response in exception cases?
> >>> [Shally] Multiples ops are allowed in flight, however condition is each op in
> >> such case is independent of
> >>> each other i.e. belong to different streams altogether.
> >>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >> related chunks of data in single
> >>> burst by passing them as ops array but later found that as not-so-useful for
> >> PMD handling for various
> >>> reasons. You may please refer to RFC v1 doc review comments for same.
> >> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> >> time, since each needs the
> >> state of the previous, to allow more than 1 op to be in-flight at a time would
> >> force PMDs to implement internal queueing and exception handling for
> >> OUT_OF_SPACE conditions you mention.
> [Ahmed] But we are putting the ops on qps which would make them
> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> complex but doable. 
[Fiona] In my opinion this is not doable, could be very inefficient.
There may be many streams.
The PMD would have to have an internal queue per stream so
it could adjust the next src offset and length in the OUT_OF_SPACE case.
And this may ripple back though all subsequent ops in the stream as each
source len is increased and its dst buffer is not big enough.

> The question is this mode of use useful for real
> life applications or would we be just adding complexity? The technical
> advantage of this is that processing of Stateful ops is interdependent
> and PMDs can take advantage of caching and other optimizations to make
> processing related ops much faster than switching on every op. PMDs have
> maintain state of more than 32 KB for DEFLATE for every stream.
> >> If the application has all the data, it can put it into chained mbufs in a single
> >> op rather than
> >> multiple ops, which avoids pushing all that complexity down to the PMDs.
> [Ahmed] I think that your suggested scheme of putting all related mbufs
> into one op may be the best solution without the extra complexity of
> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
> time If we have a way of marking mbufs as ready for consumption. The
> enqueuer may not have all the data at hand but can enqueue the op with a
> couple of empty mbus marked as not ready for consumption. The enqueuer
> will then update the rest of the mbufs to ready for consumption once the
> data is added. This introduces a race condition. A second flag for each
> mbuf can be updated by the PMD to indicate that it processed it or not.
> This way in cases where the PMD beat the application to the op, the
> application will just update the op to point to the first unprocessed
> mbuf and resend it to the PMD.
[Fiona] This doesn't sound safe. You want to add data to a stream after you've
enqueued the op. You would have to write to op.src.length at a time when the PMD
might be reading it. Sounds like a lock would be necessary.
Once the op has been enqueued, my understanding is its ownership is handed
over to the PMD and the application should not touch it until it has been dequeued.
I don't think it's a good idea to change this model.
Can't the application just collect a stream of data in chained mbufs until it has
enough to send an op, then construct the op and while waiting for that op to
complete, accumulate the next batch of chained mbufs? Only construct the next op
after the previous one is complete, based on the result of the previous one.   


> >>>> [Ahmed] Related to OUT_OF_SPACE. What status does the user recieve
> >> in a
> >>>> decompression case when the end block is encountered before the end
> >> of
> >>>> the input? Does the PMD continue decomp? Does it stop there and
> >> return
> >>>> the stop index?
> >>>>
> >>> [Shally] Before I could answer this, please help me understand your use
> >> case . When you say  "when the
> >>> end block is encountered before the end of the input?" Do you mean -
> >>> "Decompressor process a final block (i.e. has BFINAL=1 in its header) and
> >> there's some footer data after
> >>> that?" Or
> >>> you mean "decompressor process one block and has more to process till its
> >> final block?"
> >>> What is "end block" and "end of input" reference here?
> [Ahmed] I meant BFINAL=1 by end block. The end of input is the end of
> the input length.
> e.g.
> | input
> length--------------------------------------------------------------|
> |--data----data----data------data-------BFINAL-footer-unrelated data|
> >>>
[Fiona] I propose if BFINAL bit is detected before end of input
the decompression should stop. In this case consumed will be < src.length.
produced will be < dst buffer size. Do we need an extra STATUS response?
STATUS_BFINAL_DETECTED  ?
Only thing I don't like this is it can impact on performance, as normally 
we can just look for STATUS == SUCCESS. Anything else should be an exception.
Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
Do you have a suggestion on how we should handle this?
  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-31 19:03           ` Trahe, Fiona
@ 2018-02-01  5:40             ` Verma, Shally
  2018-02-01 11:54               ` Trahe, Fiona
  2018-02-01 20:23             ` Ahmed Mansour
  1 sibling, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-01  5:40 UTC (permalink / raw)
  To: Trahe, Fiona, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry



>-----Original Message-----
>From: Trahe, Fiona [mailto:fiona.trahe@intel.com]
>Sent: 01 February 2018 00:33
>To: Ahmed Mansour <ahmed.mansour@nxp.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>
>Subject: RE: [RFC v2] doc compression API for DPDK
>
>Hi Ahmed, Shally,
>
>///snip///
>> >>>>> D.1.1 Stateless and OUT_OF_SPACE
>> >>>>> ------------------------------------------------
>> >>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>> >> and
>> >>>> where PMD still has more data to produce. If PMD run into such
>> >> condition,
>> >>>> then it's an error condition in stateless processing.
>> >>>>> In such case, PMD resets itself and return with status
>> >>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>> >> i.e.
>> >>>> no input read, no output written.
>> >>>>> Application can resubmit an full input with larger output buffer size.
>> >>>> [Ahmed] Can we add an option to allow the user to read the data that
>> >> was
>> >>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>> >>>> decompression applications doing search.
>> >>> [Shally] It is there but applicable for stateful operation type (please refer to
>> >> handling out_of_space under
>> >>> "Stateful Section").
>> >>> By definition, "stateless" here means that application (such as IPCOMP)
>> >> knows maximum output size
>> >>> guaranteedly and ensure that uncompressed data size cannot grow more
>> >> than provided output buffer.
>> >>> Such apps can submit an op with type = STATELESS and provide full input,
>> >> then PMD assume it has
>> >>> sufficient input and output and thus doesn't need to maintain any contexts
>> >> after op is processed.
>> >>> If application doesn't know about max output size, then it should process it
>> >> as stateful op i.e. setup op
>> >>> with type = STATEFUL and attach a stream so that PMD can maintain
>> >> relevant context to handle such
>> >>> condition.
>> >> [Fiona] There may be an alternative that's useful for Ahmed, while still
>> >> respecting the stateless concept.
>> >> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>> >> case
>> >> it could also return consumed=0, produced = x, where x>0. X indicates the
>> >> amount of valid data which has
>> >>  been written to the output buffer. It is not complete, but if an application
>> >> wants to search it it may be sufficient.
>> >> If the application still wants the data it must resubmit the whole input with a
>> >> bigger output buffer, and
>> >>  decompression will be repeated from the start, it
>> >>  cannot expect to continue on as the PMD has not maintained state, history
>> >> or data.
>> >> I don't think there would be any need to indicate this in capabilities, PMDs
>> >> which cannot provide this
>> >> functionality would always return produced=consumed=0, while PMDs which
>> >> can could set produced > 0.
>> >> If this works for you both, we could consider a similar case for compression.
>> >>
>> > [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
>> consumed by PMD.
>> > Setting consumed = 0 with produced > 0 doesn't correlate.
>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>> of returning consumed = 0. At the same time returning consumed = y
>> implies to the user that it can proceed from the middle. I prefer the
>> consumed = 0 implementation, but I think a different return is needed to
>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>> future PMD implementations to provide recover-ability even in STATELESS
>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>> hint for the PMD implementation to make optimizations for each case, but
>> it does not force the PMD implementation to limit functionality if it
>> can provide recover-ability.
>[Fiona] So you're suggesting the following:
>OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>    can be used and next op in stream should continue on from op.consumed+1.
>OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>    Error condition, no recovery possible.
>    consumed=produced=0. Application must resubmit all input data with
>    a bigger output buffer.
>OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>     - consumed = 0, produced > 0. Application must resubmit all input data with
>        a bigger output buffer. However in decompression case, data up to produced
>        in dst buffer may be inspected/searched. Never happens in compression
>        case as output data would be meaningless.
>     - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>        can convert to stateful, using op.produced and continuing from consumed+1.
>I don't expect our PMDs to use this last case, but maybe this works for others?
>I'm not convinced it's not just adding complexity. It sounds like a version of stateful
>without a stream, and maybe less efficient?
>If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
>Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
>simply have submitted a STATEFUL request if this is the behaviour it wants?
>
>
>
>
>> >
>> >>>>> D.2 Compression API Stateful operation
>> >>>>> ----------------------------------------------------------
>> >>>>>  A Stateful operation in DPDK compression means application invokes
>> >>>> enqueue burst() multiple times to process related chunk of data either
>> >>>> because
>> >>>>> - Application broke data into several ops, and/or
>> >>>>> - PMD ran into out_of_space situation during input processing
>> >>>>>
>> >>>>> In case of either one or all of the above conditions, PMD is required to
>> >>>> maintain state of op across enque_burst() calls and
>> >>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>> >>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>> >>>> RTE_COMP_FULL/FINAL_FLUSH.
>> >>>>> D.2.1 Stateful operation state maintenance
>> >>>>> ---------------------------------------------------------------
>> >>>>> It is always an ideal expectation from application that it should parse
>> >>>> through all related chunk of source data making its mbuf-chain and
>> >> enqueue
>> >>>> it for stateless processing.
>> >>>>> However, if it need to break it into several enqueue_burst() calls, then
>> >> an
>> >>>> expected call flow would be something like:
>> >>>>> enqueue_burst( |op.no_flush |)
>> >>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>> >>>> burst in a loop until all ops are received. Is this correct?
>> >>>>
>> >>>>> deque_burst(op) // should dequeue before we enqueue next
>> >>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>> >> this illustration is specifically in
>> >>> context of stateful op processing to reflect if a stream is broken into
>> >> chunks, then each chunk should be
>> >>> submitted as one op at-a-time with type = STATEFUL and need to be
>> >> dequeued first before next chunk is
>> >>> enqueued.
>> >>>
>> >>>>> enqueue_burst( |op.no_flush |)
>> >>>>> deque_burst(op) // should dequeue before we enqueue next
>> >>>>> enqueue_burst( |op.full_flush |)
>> >>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>> >>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>> >> distinguish
>> >>>> the response in exception cases?
>> >>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>> >> such case is independent of
>> >>> each other i.e. belong to different streams altogether.
>> >>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>> >> related chunks of data in single
>> >>> burst by passing them as ops array but later found that as not-so-useful for
>> >> PMD handling for various
>> >>> reasons. You may please refer to RFC v1 doc review comments for same.
>> >> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>> >> time, since each needs the
>> >> state of the previous, to allow more than 1 op to be in-flight at a time would
>> >> force PMDs to implement internal queueing and exception handling for
>> >> OUT_OF_SPACE conditions you mention.
>> [Ahmed] But we are putting the ops on qps which would make them
>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>> complex but doable.
>[Fiona] In my opinion this is not doable, could be very inefficient.
>There may be many streams.
>The PMD would have to have an internal queue per stream so
>it could adjust the next src offset and length in the OUT_OF_SPACE case.
>And this may ripple back though all subsequent ops in the stream as each
>source len is increased and its dst buffer is not big enough.
>
>> The question is this mode of use useful for real
>> life applications or would we be just adding complexity? The technical
>> advantage of this is that processing of Stateful ops is interdependent
>> and PMDs can take advantage of caching and other optimizations to make
>> processing related ops much faster than switching on every op. PMDs have
>> maintain state of more than 32 KB for DEFLATE for every stream.
>> >> If the application has all the data, it can put it into chained mbufs in a single
>> >> op rather than
>> >> multiple ops, which avoids pushing all that complexity down to the PMDs.
>> [Ahmed] I think that your suggested scheme of putting all related mbufs
>> into one op may be the best solution without the extra complexity of
>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
>> time If we have a way of marking mbufs as ready for consumption. The
>> enqueuer may not have all the data at hand but can enqueue the op with a
>> couple of empty mbus marked as not ready for consumption. The enqueuer
>> will then update the rest of the mbufs to ready for consumption once the
>> data is added. This introduces a race condition. A second flag for each
>> mbuf can be updated by the PMD to indicate that it processed it or not.
>> This way in cases where the PMD beat the application to the op, the
>> application will just update the op to point to the first unprocessed
>> mbuf and resend it to the PMD.
>[Fiona] This doesn't sound safe. You want to add data to a stream after you've
>enqueued the op. You would have to write to op.src.length at a time when the PMD
>might be reading it. Sounds like a lock would be necessary.
>Once the op has been enqueued, my understanding is its ownership is handed
>over to the PMD and the application should not touch it until it has been dequeued.
>I don't think it's a good idea to change this model.
>Can't the application just collect a stream of data in chained mbufs until it has
>enough to send an op, then construct the op and while waiting for that op to
>complete, accumulate the next batch of chained mbufs? Only construct the next op
>after the previous one is complete, based on the result of the previous one.
>
>
>> >>>> [Ahmed] Related to OUT_OF_SPACE. What status does the user recieve
>> >> in a
>> >>>> decompression case when the end block is encountered before the end
>> >> of
>> >>>> the input? Does the PMD continue decomp? Does it stop there and
>> >> return
>> >>>> the stop index?
>> >>>>
>> >>> [Shally] Before I could answer this, please help me understand your use
>> >> case . When you say  "when the
>> >>> end block is encountered before the end of the input?" Do you mean -
>> >>> "Decompressor process a final block (i.e. has BFINAL=1 in its header) and
>> >> there's some footer data after
>> >>> that?" Or
>> >>> you mean "decompressor process one block and has more to process till its
>> >> final block?"
>> >>> What is "end block" and "end of input" reference here?
>> [Ahmed] I meant BFINAL=1 by end block. The end of input is the end of
>> the input length.
>> e.g.
>> | input
>> length--------------------------------------------------------------|
>> |--data----data----data------data-------BFINAL-footer-unrelated data|
>> >>>
>[Fiona] I propose if BFINAL bit is detected before end of input
>the decompression should stop. In this case consumed will be < src.length.
>produced will be < dst buffer size. Do we need an extra STATUS response?
>STATUS_BFINAL_DETECTED  ?
[Shally] @fiona, I assume you mean here decompressor stop after processing Final block right? And if yes, and if it can process that final block successfully/unsuccessfully, then status could simply be SUCCESS/FAILED.
I don't see need of specific return code for this use case. Just to share, in past, we have practically run into such cases with boost lib, and decompressor has simply worked this way.

>Only thing I don't like this is it can impact on performance, as normally
>we can just look for STATUS == SUCCESS. Anything else should be an exception.
>Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
>Do you have a suggestion on how we should handle this?
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-01  5:40             ` Verma, Shally
@ 2018-02-01 11:54               ` Trahe, Fiona
  2018-02-01 20:50                 ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Trahe, Fiona @ 2018-02-01 11:54 UTC (permalink / raw)
  To: Verma, Shally, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona


> >[Fiona] I propose if BFINAL bit is detected before end of input
> >the decompression should stop. In this case consumed will be < src.length.
> >produced will be < dst buffer size. Do we need an extra STATUS response?
> >STATUS_BFINAL_DETECTED  ?
> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
[Fiona] Yes.

 And if yes,
> and if it can process that final block successfully/unsuccessfully, then status could simply be
> SUCCESS/FAILED.
> I don't see need of specific return code for this use case. Just to share, in past, we have practically run into
> such cases with boost lib, and decompressor has simply worked this way.
[Fiona] I'm ok with this.

> >Only thing I don't like this is it can impact on performance, as normally
> >we can just look for STATUS == SUCCESS. Anything else should be an exception.
> >Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
> >Do you have a suggestion on how we should handle this?
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-01-31 19:03           ` Trahe, Fiona
  2018-02-01  5:40             ` Verma, Shally
@ 2018-02-01 20:23             ` Ahmed Mansour
  2018-02-14  7:41               ` Verma, Shally
  1 sibling, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-01 20:23 UTC (permalink / raw)
  To: Trahe, Fiona, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
> Hi Ahmed, Shally,
>
> ///snip///
>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>>>> ------------------------------------------------
>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>>>> and
>>>>>> where PMD still has more data to produce. If PMD run into such
>>>> condition,
>>>>>> then it's an error condition in stateless processing.
>>>>>>> In such case, PMD resets itself and return with status
>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>>>> i.e.
>>>>>> no input read, no output written.
>>>>>>> Application can resubmit an full input with larger output buffer size.
>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
>>>> was
>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>>>> decompression applications doing search.
>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
>>>> handling out_of_space under
>>>>> "Stateful Section").
>>>>> By definition, "stateless" here means that application (such as IPCOMP)
>>>> knows maximum output size
>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
>>>> than provided output buffer.
>>>>> Such apps can submit an op with type = STATELESS and provide full input,
>>>> then PMD assume it has
>>>>> sufficient input and output and thus doesn't need to maintain any contexts
>>>> after op is processed.
>>>>> If application doesn't know about max output size, then it should process it
>>>> as stateful op i.e. setup op
>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
>>>> relevant context to handle such
>>>>> condition.
>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>>>> respecting the stateless concept.
>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>>>> case
>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
>>>> amount of valid data which has
>>>>  been written to the output buffer. It is not complete, but if an application
>>>> wants to search it it may be sufficient.
>>>> If the application still wants the data it must resubmit the whole input with a
>>>> bigger output buffer, and
>>>>  decompression will be repeated from the start, it
>>>>  cannot expect to continue on as the PMD has not maintained state, history
>>>> or data.
>>>> I don't think there would be any need to indicate this in capabilities, PMDs
>>>> which cannot provide this
>>>> functionality would always return produced=consumed=0, while PMDs which
>>>> can could set produced > 0.
>>>> If this works for you both, we could consider a similar case for compression.
>>>>
>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
>> consumed by PMD.
>>> Setting consumed = 0 with produced > 0 doesn't correlate.
>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>> of returning consumed = 0. At the same time returning consumed = y
>> implies to the user that it can proceed from the middle. I prefer the
>> consumed = 0 implementation, but I think a different return is needed to
>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>> future PMD implementations to provide recover-ability even in STATELESS
>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>> hint for the PMD implementation to make optimizations for each case, but
>> it does not force the PMD implementation to limit functionality if it
>> can provide recover-ability.
> [Fiona] So you're suggesting the following:
> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>     can be used and next op in stream should continue on from op.consumed+1.
> OUT_OF_SPACE_TERMINATED - returned only on stateless operation. 
>     Error condition, no recovery possible.
>     consumed=produced=0. Application must resubmit all input data with
>     a bigger output buffer.
> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>      - consumed = 0, produced > 0. Application must resubmit all input data with
>         a bigger output buffer. However in decompression case, data up to produced 
>         in dst buffer may be inspected/searched. Never happens in compression 
>         case as output data would be meaningless.
>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>         can convert to stateful, using op.produced and continuing from consumed+1. 
> I don't expect our PMDs to use this last case, but maybe this works for others?
> I'm not convinced it's not just adding complexity. It sounds like a version of stateful 
> without a stream, and maybe less efficient?
> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
> simply have submitted a STATEFUL request if this is the behaviour it wants?
[Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
and replacing it with
OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
        Error condition, no recovery possible.
        - consumed=0 produced=amount of data produced. Application must
resubmit all input data with
          a bigger output buffer to process all of the op
OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
an error. Op.produced
    can be used and next op in stream should continue on from op.consumed+1.
        -  consumed > 0, produced > 0. PMD has stored relevant state and
history and so
            can continue using op.produced and continuing from consumed+1.

We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
implementation either.

Regardless of speculative future PMDs. The more important aspect of this
for today is that the return status clearly determines
the meaning of "consumed". If it is RECOVERABLE then consumed is
meaningful. if it is TERMINATED then consumed in meaningless.
This way we take away the ambiguity of having OUT_OF_SPACE mean two
different user work flows.

A speculative future PMD may be designed to return RECOVERABLE for
stateless ops that are attached to streams.
A future PMD may look to see if an op has a stream is attached and write
out the state there and go into recoverable mode.
in essence this leaves the choice up to the implementation and allows
the PMD to take advantage of stateless optimizations
so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
context as soon as it fully processes an op. It will only
write context out in cases where the op chokes.
This futuristic PMD should ignore the FLUSH since this STATELESS mode as
indicated by the user and optimize
>>>>>>> D.2 Compression API Stateful operation
>>>>>>> ----------------------------------------------------------
>>>>>>>  A Stateful operation in DPDK compression means application invokes
>>>>>> enqueue burst() multiple times to process related chunk of data either
>>>>>> because
>>>>>>> - Application broke data into several ops, and/or
>>>>>>> - PMD ran into out_of_space situation during input processing
>>>>>>>
>>>>>>> In case of either one or all of the above conditions, PMD is required to
>>>>>> maintain state of op across enque_burst() calls and
>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>> ---------------------------------------------------------------
>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>> through all related chunk of source data making its mbuf-chain and
>>>> enqueue
>>>>>> it for stateless processing.
>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>> an
>>>>>> expected call flow would be something like:
>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>
>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>> this illustration is specifically in
>>>>> context of stateful op processing to reflect if a stream is broken into
>>>> chunks, then each chunk should be
>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>> dequeued first before next chunk is
>>>>> enqueued.
>>>>>
>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>> distinguish
>>>>>> the response in exception cases?
>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>> such case is independent of
>>>>> each other i.e. belong to different streams altogether.
>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>> related chunks of data in single
>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>> PMD handling for various
>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>> time, since each needs the
>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>> force PMDs to implement internal queueing and exception handling for
>>>> OUT_OF_SPACE conditions you mention.
>> [Ahmed] But we are putting the ops on qps which would make them
>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>> complex but doable. 
> [Fiona] In my opinion this is not doable, could be very inefficient.
> There may be many streams.
> The PMD would have to have an internal queue per stream so
> it could adjust the next src offset and length in the OUT_OF_SPACE case.
> And this may ripple back though all subsequent ops in the stream as each
> source len is increased and its dst buffer is not big enough.
[Ahmed] Regarding multi op OUT_OF_SPACE handling.
The caller would still need to adjust
the src length/output buffer as you say. The PMD cannot handle
OUT_OF_SPACE internally.
After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
until it gets explicit
confirmation from the caller to continue working on this stream. Any ops
received by
the PMD should be returned to the caller with status STREAM_PAUSED since
the caller did not
explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
These semantics can be enabled by adding a new function to the API
perhaps stream_resume().
This allows the caller to indicate that it acknowledges that it has seen
the issue and this op
should be used to resolve the issue. Implementations that do not support
this mode of use
can push back immediately after one op is in flight. Implementations
that support this use
mode can allow many ops from the same session

Regarding the ordering of ops
We do force serialization of ops belonging to a stream in STATEFUL
operation. Related ops do
not go out of order and are given to available PMDs one at a time.

>> The question is this mode of use useful for real
>> life applications or would we be just adding complexity? The technical
>> advantage of this is that processing of Stateful ops is interdependent
>> and PMDs can take advantage of caching and other optimizations to make
>> processing related ops much faster than switching on every op. PMDs have
>> maintain state of more than 32 KB for DEFLATE for every stream.
>>>> If the application has all the data, it can put it into chained mbufs in a single
>>>> op rather than
>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
>> [Ahmed] I think that your suggested scheme of putting all related mbufs
>> into one op may be the best solution without the extra complexity of
>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
>> time If we have a way of marking mbufs as ready for consumption. The
>> enqueuer may not have all the data at hand but can enqueue the op with a
>> couple of empty mbus marked as not ready for consumption. The enqueuer
>> will then update the rest of the mbufs to ready for consumption once the
>> data is added. This introduces a race condition. A second flag for each
>> mbuf can be updated by the PMD to indicate that it processed it or not.
>> This way in cases where the PMD beat the application to the op, the
>> application will just update the op to point to the first unprocessed
>> mbuf and resend it to the PMD.
> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
> enqueued the op. You would have to write to op.src.length at a time when the PMD
> might be reading it. Sounds like a lock would be necessary.
> Once the op has been enqueued, my understanding is its ownership is handed
> over to the PMD and the application should not touch it until it has been dequeued.
> I don't think it's a good idea to change this model.
> Can't the application just collect a stream of data in chained mbufs until it has
> enough to send an op, then construct the op and while waiting for that op to
> complete, accumulate the next batch of chained mbufs? Only construct the next op
> after the previous one is complete, based on the result of the previous one.   
>
[Ahmed] Fair enough. I agree with you. I imagined it in a different way
in which each mbuf would have its own length.
The advantage to gain is in applications where there is one PMD user,
the down time between ops can be significant and setting up a single
producer consumer pair significantly reduces the CPU cycles and PMD down
time.

////snip////

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-01 11:54               ` Trahe, Fiona
@ 2018-02-01 20:50                 ` Ahmed Mansour
  2018-02-14  5:41                   ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-01 20:50 UTC (permalink / raw)
  To: Trahe, Fiona, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

>>> [Fiona] I propose if BFINAL bit is detected before end of input
>>> the decompression should stop. In this case consumed will be < src.length.
>>> produced will be < dst buffer size. Do we need an extra STATUS response?
>>> STATUS_BFINAL_DETECTED  ?
>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
> [Fiona] Yes.
>
>  And if yes,
>> and if it can process that final block successfully/unsuccessfully, then status could simply be
>> SUCCESS/FAILED.
>> I don't see need of specific return code for this use case. Just to share, in past, we have practically run into
>> such cases with boost lib, and decompressor has simply worked this way.
> [Fiona] I'm ok with this.
>
>>> Only thing I don't like this is it can impact on performance, as normally
>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
>>> Do you have a suggestion on how we should handle this?
>>>
>
[Ahmed] This makes sense. So in all cases the PMD should assume that it
should stop as soon as a BFINAL is observed.

A question. What happens ins stateful vs stateless modes when
decompressing an op that encompasses multiple BFINALs. I assume the
caller in that case will use the consumed=x bytes to find out how far in
to the input is the end of the first stream and start from the next
byte. Is this correct?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-01 20:50                 ` Ahmed Mansour
@ 2018-02-14  5:41                   ` Verma, Shally
  2018-02-14 16:54                     ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-14  5:41 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi Ahmed

>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>Sent: 02 February 2018 02:20
>To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>>>> [Fiona] I propose if BFINAL bit is detected before end of input
>>>> the decompression should stop. In this case consumed will be < src.length.
>>>> produced will be < dst buffer size. Do we need an extra STATUS response?
>>>> STATUS_BFINAL_DETECTED  ?
>>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
>> [Fiona] Yes.
>>
>>  And if yes,
>>> and if it can process that final block successfully/unsuccessfully, then status could simply be
>>> SUCCESS/FAILED.
>>> I don't see need of specific return code for this use case. Just to share, in past, we have practically run into
>>> such cases with boost lib, and decompressor has simply worked this way.
>> [Fiona] I'm ok with this.
>>
>>>> Only thing I don't like this is it can impact on performance, as normally
>>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
>>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
>>>> Do you have a suggestion on how we should handle this?
>>>>
>>
>[Ahmed] This makes sense. So in all cases the PMD should assume that it
>should stop as soon as a BFINAL is observed.
>
>A question. What happens ins stateful vs stateless modes when
>decompressing an op that encompasses multiple BFINALs. I assume the
>caller in that case will use the consumed=x bytes to find out how far in
>to the input is the end of the first stream and start from the next
>byte. Is this correct?

[Shally]  As per my understanding, each op can be tied up to only one stream as we have only one stream pointer per op and one stream can have only one BFINAL (as stream is one complete compressed data) but looks like you're suggesting a case where one op can carry multiple independent streams? and thus multiple BFINAL?! , such as, below here is op pointing to more than one streams

            --------------------------------------------
op --> |stream1|stream2| |stream3|
           --------------------------------------------

Could you confirm if I understand your question correct?

Thanks
Shally

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-01 20:23             ` Ahmed Mansour
@ 2018-02-14  7:41               ` Verma, Shally
  2018-02-15 18:47                 ` Trahe, Fiona
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-14  7:41 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi Ahmed,

>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>Sent: 02 February 2018 01:53
>To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
>> Hi Ahmed, Shally,
>>
>> ///snip///
>>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>>>>> ------------------------------------------------
>>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>>>>> and
>>>>>>> where PMD still has more data to produce. If PMD run into such
>>>>> condition,
>>>>>>> then it's an error condition in stateless processing.
>>>>>>>> In such case, PMD resets itself and return with status
>>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>>>>> i.e.
>>>>>>> no input read, no output written.
>>>>>>>> Application can resubmit an full input with larger output buffer size.
>>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
>>>>> was
>>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>>>>> decompression applications doing search.
>>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
>>>>> handling out_of_space under
>>>>>> "Stateful Section").
>>>>>> By definition, "stateless" here means that application (such as IPCOMP)
>>>>> knows maximum output size
>>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
>>>>> than provided output buffer.
>>>>>> Such apps can submit an op with type = STATELESS and provide full input,
>>>>> then PMD assume it has
>>>>>> sufficient input and output and thus doesn't need to maintain any contexts
>>>>> after op is processed.
>>>>>> If application doesn't know about max output size, then it should process it
>>>>> as stateful op i.e. setup op
>>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
>>>>> relevant context to handle such
>>>>>> condition.
>>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>>>>> respecting the stateless concept.
>>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>>>>> case
>>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
>>>>> amount of valid data which has
>>>>>  been written to the output buffer. It is not complete, but if an application
>>>>> wants to search it it may be sufficient.
>>>>> If the application still wants the data it must resubmit the whole input with a
>>>>> bigger output buffer, and
>>>>>  decompression will be repeated from the start, it
>>>>>  cannot expect to continue on as the PMD has not maintained state, history
>>>>> or data.
>>>>> I don't think there would be any need to indicate this in capabilities, PMDs
>>>>> which cannot provide this
>>>>> functionality would always return produced=consumed=0, while PMDs which
>>>>> can could set produced > 0.
>>>>> If this works for you both, we could consider a similar case for compression.
>>>>>
>>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
>>> consumed by PMD.
>>>> Setting consumed = 0 with produced > 0 doesn't correlate.
>>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>>> of returning consumed = 0. At the same time returning consumed = y
>>> implies to the user that it can proceed from the middle. I prefer the
>>> consumed = 0 implementation, but I think a different return is needed to
>>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>>> future PMD implementations to provide recover-ability even in STATELESS
>>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>>> hint for the PMD implementation to make optimizations for each case, but
>>> it does not force the PMD implementation to limit functionality if it
>>> can provide recover-ability.
>> [Fiona] So you're suggesting the following:
>> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>>     can be used and next op in stream should continue on from op.consumed+1.
>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>     Error condition, no recovery possible.
>>     consumed=produced=0. Application must resubmit all input data with
>>     a bigger output buffer.
>> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>>      - consumed = 0, produced > 0. Application must resubmit all input data with
>>         a bigger output buffer. However in decompression case, data up to produced
>>         in dst buffer may be inspected/searched. Never happens in compression
>>         case as output data would be meaningless.
>>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>>         can convert to stateful, using op.produced and continuing from consumed+1.
>> I don't expect our PMDs to use this last case, but maybe this works for others?
>> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
>> without a stream, and maybe less efficient?
>> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
>> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
>> simply have submitted a STATEFUL request if this is the behaviour it wants?
>[Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
>and replacing it with
>OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>        Error condition, no recovery possible.
>        - consumed=0 produced=amount of data produced. Application must
>resubmit all input data with
>          a bigger output buffer to process all of the op
>OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
>an error. Op.produced
>    can be used and next op in stream should continue on from op.consumed+1.
>        -  consumed > 0, produced > 0. PMD has stored relevant state and
>history and so
>            can continue using op.produced and continuing from consumed+1.
>
>We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
>implementation either.
>
>Regardless of speculative future PMDs. The more important aspect of this
>for today is that the return status clearly determines
>the meaning of "consumed". If it is RECOVERABLE then consumed is
>meaningful. if it is TERMINATED then consumed in meaningless.
>This way we take away the ambiguity of having OUT_OF_SPACE mean two
>different user work flows.
>
>A speculative future PMD may be designed to return RECOVERABLE for
>stateless ops that are attached to streams.
>A future PMD may look to see if an op has a stream is attached and write
>out the state there and go into recoverable mode.
>in essence this leaves the choice up to the implementation and allows
>the PMD to take advantage of stateless optimizations
>so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
>context as soon as it fully processes an op. It will only
>write context out in cases where the op chokes.
>This futuristic PMD should ignore the FLUSH since this STATELESS mode as
>indicated by the user and optimize

[Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with definition as you mentioned and seem doable.
So then it mean all following conditions:
a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user has to start all over again, it's a failure (as in current definition)
b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain states in stream pointer)
c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature flag enabled or not

and one more exception case is:
d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD internally maintained that state somehow and consumed & produced > 0, so user can start consumed+1 but there's restriction on user not to alter or change op until it is fully processed?!
 
API currently takes care of case a and c, and case b can be supported if specification accept another proposal which mention optional usage of stream with stateless. Until then API takes no difference to case b and c i.e. we can have op such as,
- type= stateful with flush = full/final, stream pointer provided, PMD can return TERMINATED/RECOVERABLE according to its ability

Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be doable with concept of different return code.

>>>>>>>> D.2 Compression API Stateful operation
>>>>>>>> ----------------------------------------------------------
>>>>>>>>  A Stateful operation in DPDK compression means application invokes
>>>>>>> enqueue burst() multiple times to process related chunk of data either
>>>>>>> because
>>>>>>>> - Application broke data into several ops, and/or
>>>>>>>> - PMD ran into out_of_space situation during input processing
>>>>>>>>
>>>>>>>> In case of either one or all of the above conditions, PMD is required to
>>>>>>> maintain state of op across enque_burst() calls and
>>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>> enqueue
>>>>>>> it for stateless processing.
>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>> an
>>>>>>> expected call flow would be something like:
>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>
>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>> this illustration is specifically in
>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>> chunks, then each chunk should be
>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>> dequeued first before next chunk is
>>>>>> enqueued.
>>>>>>
>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>> distinguish
>>>>>>> the response in exception cases?
>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>> such case is independent of
>>>>>> each other i.e. belong to different streams altogether.
>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>> related chunks of data in single
>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>> PMD handling for various
>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>> time, since each needs the
>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>> force PMDs to implement internal queueing and exception handling for
>>>>> OUT_OF_SPACE conditions you mention.
>>> [Ahmed] But we are putting the ops on qps which would make them
>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>> complex but doable.
>> [Fiona] In my opinion this is not doable, could be very inefficient.
>> There may be many streams.
>> The PMD would have to have an internal queue per stream so
>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>> And this may ripple back though all subsequent ops in the stream as each
>> source len is increased and its dst buffer is not big enough.
>[Ahmed] Regarding multi op OUT_OF_SPACE handling.
>The caller would still need to adjust
>the src length/output buffer as you say. The PMD cannot handle
>OUT_OF_SPACE internally.
>After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>until it gets explicit
>confirmation from the caller to continue working on this stream. Any ops
>received by
>the PMD should be returned to the caller with status STREAM_PAUSED since
>the caller did not
>explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>These semantics can be enabled by adding a new function to the API
>perhaps stream_resume().
>This allows the caller to indicate that it acknowledges that it has seen
>the issue and this op
>should be used to resolve the issue. Implementations that do not support
>this mode of use
>can push back immediately after one op is in flight. Implementations
>that support this use
>mode can allow many ops from the same session
>
[Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would still say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think is the case with DPDK PMD).
Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above PMD can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output and pass than as one op.
Is it just to ease applications of chained mbuf burden or do you see any performance /use-case impacting aspect also?

if it is in context where each op belong to different stream in a burst, then why do we need stream_pause and resume? It is a expectations from app to pass more output buffer with consumed + 1 from next call onwards as it has already
seen OUT_OF_SPACE.

>Regarding the ordering of ops
>We do force serialization of ops belonging to a stream in STATEFUL
>operation. Related ops do
>not go out of order and are given to available PMDs one at a time.
>
>>> The question is this mode of use useful for real
>>> life applications or would we be just adding complexity? The technical
>>> advantage of this is that processing of Stateful ops is interdependent
>>> and PMDs can take advantage of caching and other optimizations to make
>>> processing related ops much faster than switching on every op. PMDs have
>>> maintain state of more than 32 KB for DEFLATE for every stream.
>>>>> If the application has all the data, it can put it into chained mbufs in a single
>>>>> op rather than
>>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
>>> [Ahmed] I think that your suggested scheme of putting all related mbufs
>>> into one op may be the best solution without the extra complexity of
>>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
>>> time If we have a way of marking mbufs as ready for consumption. The
>>> enqueuer may not have all the data at hand but can enqueue the op with a
>>> couple of empty mbus marked as not ready for consumption. The enqueuer
>>> will then update the rest of the mbufs to ready for consumption once the
>>> data is added. This introduces a race condition. A second flag for each
>>> mbuf can be updated by the PMD to indicate that it processed it or not.
>>> This way in cases where the PMD beat the application to the op, the
>>> application will just update the op to point to the first unprocessed
>>> mbuf and resend it to the PMD.
>> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
>> enqueued the op. You would have to write to op.src.length at a time when the PMD
>> might be reading it. Sounds like a lock would be necessary.
>> Once the op has been enqueued, my understanding is its ownership is handed
>> over to the PMD and the application should not touch it until it has been dequeued.
>> I don't think it's a good idea to change this model.
>> Can't the application just collect a stream of data in chained mbufs until it has
>> enough to send an op, then construct the op and while waiting for that op to
>> complete, accumulate the next batch of chained mbufs? Only construct the next op
>> after the previous one is complete, based on the result of the previous one.
>>
>[Ahmed] Fair enough. I agree with you. I imagined it in a different way
>in which each mbuf would have its own length.
>The advantage to gain is in applications where there is one PMD user,
>the down time between ops can be significant and setting up a single
>producer consumer pair significantly reduces the CPU cycles and PMD down
>time.
>
>////snip////

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-14  5:41                   ` Verma, Shally
@ 2018-02-14 16:54                     ` Ahmed Mansour
  2018-02-15  5:53                       ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-14 16:54 UTC (permalink / raw)
  To: Verma, Shally, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

On 2/14/2018 12:41 AM, Verma, Shally wrote:
> Hi Ahmed
>
>> -----Original Message-----
>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>> Sent: 02 February 2018 02:20
>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>> Subject: Re: [RFC v2] doc compression API for DPDK
>>
>>>>> [Fiona] I propose if BFINAL bit is detected before end of input
>>>>> the decompression should stop. In this case consumed will be < src.length.
>>>>> produced will be < dst buffer size. Do we need an extra STATUS response?
>>>>> STATUS_BFINAL_DETECTED  ?
>>>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
>>> [Fiona] Yes.
>>>
>>>  And if yes,
>>>> and if it can process that final block successfully/unsuccessfully, then status could simply be
>>>> SUCCESS/FAILED.
>>>> I don't see need of specific return code for this use case. Just to share, in past, we have practically run into
>>>> such cases with boost lib, and decompressor has simply worked this way.
>>> [Fiona] I'm ok with this.
>>>
>>>>> Only thing I don't like this is it can impact on performance, as normally
>>>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
>>>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
>>>>> Do you have a suggestion on how we should handle this?
>>>>>
>> [Ahmed] This makes sense. So in all cases the PMD should assume that it
>> should stop as soon as a BFINAL is observed.
>>
>> A question. What happens ins stateful vs stateless modes when
>> decompressing an op that encompasses multiple BFINALs. I assume the
>> caller in that case will use the consumed=x bytes to find out how far in
>> to the input is the end of the first stream and start from the next
>> byte. Is this correct?
> [Shally]  As per my understanding, each op can be tied up to only one stream as we have only one stream pointer per op and one stream can have only one BFINAL (as stream is one complete compressed data) but looks like you're suggesting a case where one op can carry multiple independent streams? and thus multiple BFINAL?! , such as, below here is op pointing to more than one streams
>
>             --------------------------------------------
> op --> |stream1|stream2| |stream3|
>            --------------------------------------------
>
> Could you confirm if I understand your question correct?
[Ahmed] Correct. We found that in some storage applications the user
does not know where exactly the BFINAL is. They rely on zlib software
today. zlib.net software halts at the first BFINAL. Users put multiple
streams in one op and rely on zlib to  stop and inform them of the end
location of the first stream.
>
> Thanks
> Shally
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-14 16:54                     ` Ahmed Mansour
@ 2018-02-15  5:53                       ` Verma, Shally
  2018-02-15 17:20                         ` Trahe, Fiona
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-15  5:53 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry



>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>Sent: 14 February 2018 22:25
>To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>On 2/14/2018 12:41 AM, Verma, Shally wrote:
>> Hi Ahmed
>>
>>> -----Original Message-----
>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>> Sent: 02 February 2018 02:20
>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>
>>>>>> [Fiona] I propose if BFINAL bit is detected before end of input
>>>>>> the decompression should stop. In this case consumed will be < src.length.
>>>>>> produced will be < dst buffer size. Do we need an extra STATUS response?
>>>>>> STATUS_BFINAL_DETECTED  ?
>>>>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
>>>> [Fiona] Yes.
>>>>
>>>>  And if yes,
>>>>> and if it can process that final block successfully/unsuccessfully, then status could simply be
>>>>> SUCCESS/FAILED.
>>>>> I don't see need of specific return code for this use case. Just to share, in past, we have practically run into
>>>>> such cases with boost lib, and decompressor has simply worked this way.
>>>> [Fiona] I'm ok with this.
>>>>
>>>>>> Only thing I don't like this is it can impact on performance, as normally
>>>>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
>>>>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
>>>>>> Do you have a suggestion on how we should handle this?
>>>>>>
>>> [Ahmed] This makes sense. So in all cases the PMD should assume that it
>>> should stop as soon as a BFINAL is observed.
>>>
>>> A question. What happens ins stateful vs stateless modes when
>>> decompressing an op that encompasses multiple BFINALs. I assume the
>>> caller in that case will use the consumed=x bytes to find out how far in
>>> to the input is the end of the first stream and start from the next
>>> byte. Is this correct?
>> [Shally]  As per my understanding, each op can be tied up to only one stream as we have only one stream pointer per op and one
>stream can have only one BFINAL (as stream is one complete compressed data) but looks like you're suggesting a case where one op
>can carry multiple independent streams? and thus multiple BFINAL?! , such as, below here is op pointing to more than one streams
>>
>>             --------------------------------------------
>> op --> |stream1|stream2| |stream3|
>>            --------------------------------------------
>>
>> Could you confirm if I understand your question correct?
>[Ahmed] Correct. We found that in some storage applications the user
>does not know where exactly the BFINAL is. They rely on zlib software
>today. zlib.net software halts at the first BFINAL. Users put multiple
>streams in one op and rely on zlib to  stop and inform them of the end
>location of the first stream.

[Shally] Then this is practically case possible on decompressor and decompressor doesn't regard flush flag. So in that case, I expect PMD to internally reset themselves (say in case of zlib going through cycle of deflateEnd and deflateInit or deflateReset) and return with status = SUCCESS with updated produced and consumed. Now in such case, if previous stream also has some footer followed by start of next stream, then I am not sure how PMD / lib can support that case. Have you had practically run of such use-case on zlib? If yes, how then such application handle it in your experience? 
I can imagine for such input zlib would return with Z_FLUSH_END after 1st BFINAL is processed to the user. Then application doing deflateReset() or Init-End() cycle before starting with next. But if it starts with input that doesn't have valid zlib header, then likely it will throw an error.

>>
>> Thanks
>> Shally
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-15  5:53                       ` Verma, Shally
@ 2018-02-15 17:20                         ` Trahe, Fiona
  2018-02-15 19:51                           ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Trahe, Fiona @ 2018-02-15 17:20 UTC (permalink / raw)
  To: Verma, Shally, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona

Hi Ahmed, Shally,

> -----Original Message-----
> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> Sent: Thursday, February 15, 2018 5:53 AM
> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
> dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: RE: [RFC v2] doc compression API for DPDK
> 
> 
> 
> >-----Original Message-----
> >From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> >Sent: 14 February 2018 22:25
> >To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
> >Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila
> ><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
> Mahipal
> ><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
> <hemant.agrawal@nxp.com>; Roy
> >Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >Subject: Re: [RFC v2] doc compression API for DPDK
> >
> >On 2/14/2018 12:41 AM, Verma, Shally wrote:
> >> Hi Ahmed
> >>
> >>> -----Original Message-----
> >>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> >>> Sent: 02 February 2018 02:20
> >>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>;
> dev@dpdk.org
> >>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila
> >>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
> Mahipal
> >>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
> <hemant.agrawal@nxp.com>; Roy
> >>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >>> Subject: Re: [RFC v2] doc compression API for DPDK
> >>>
> >>>>>> [Fiona] I propose if BFINAL bit is detected before end of input
> >>>>>> the decompression should stop. In this case consumed will be < src.length.
> >>>>>> produced will be < dst buffer size. Do we need an extra STATUS response?
> >>>>>> STATUS_BFINAL_DETECTED  ?
> >>>>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
> >>>> [Fiona] Yes.
> >>>>
> >>>>  And if yes,
> >>>>> and if it can process that final block successfully/unsuccessfully, then status could simply be
> >>>>> SUCCESS/FAILED.
> >>>>> I don't see need of specific return code for this use case. Just to share, in past, we have practically
> run into
> >>>>> such cases with boost lib, and decompressor has simply worked this way.
> >>>> [Fiona] I'm ok with this.
> >>>>
> >>>>>> Only thing I don't like this is it can impact on performance, as normally
> >>>>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
> >>>>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
> >>>>>> Do you have a suggestion on how we should handle this?
> >>>>>>
> >>> [Ahmed] This makes sense. So in all cases the PMD should assume that it
> >>> should stop as soon as a BFINAL is observed.
> >>>
> >>> A question. What happens ins stateful vs stateless modes when
> >>> decompressing an op that encompasses multiple BFINALs. I assume the
> >>> caller in that case will use the consumed=x bytes to find out how far in
> >>> to the input is the end of the first stream and start from the next
> >>> byte. Is this correct?
> >> [Shally]  As per my understanding, each op can be tied up to only one stream as we have only one
> stream pointer per op and one
> >stream can have only one BFINAL (as stream is one complete compressed data) but looks like you're
> suggesting a case where one op
> >can carry multiple independent streams? and thus multiple BFINAL?! , such as, below here is op
> pointing to more than one streams
> >>
> >>             --------------------------------------------
> >> op --> |stream1|stream2| |stream3|
> >>            --------------------------------------------
> >>
> >> Could you confirm if I understand your question correct?
> >[Ahmed] Correct. We found that in some storage applications the user
> >does not know where exactly the BFINAL is. They rely on zlib software
> >today. zlib.net software halts at the first BFINAL. Users put multiple
> >streams in one op and rely on zlib to  stop and inform them of the end
> >location of the first stream.
> 
> [Shally] Then this is practically case possible on decompressor and decompressor doesn't regard flush
> flag. So in that case, I expect PMD to internally reset themselves (say in case of zlib going through cycle
> of deflateEnd and deflateInit or deflateReset) and return with status = SUCCESS with updated produced
> and consumed. Now in such case, if previous stream also has some footer followed by start of next
> stream, then I am not sure how PMD / lib can support that case. Have you had practically run of such
> use-case on zlib? If yes, how then such application handle it in your experience?
> I can imagine for such input zlib would return with Z_FLUSH_END after 1st BFINAL is processed to the
> user. Then application doing deflateReset() or Init-End() cycle before starting with next. But if it starts
> with input that doesn't have valid zlib header, then likely it will throw an error.
> 
[Fiona] The consumed and produced tell the Application hw much data was processed up to 
the end of the first deflate block encountered with a bfinal set.
If there is data, e.g. footer after the block with bfinal, then I think it must be the responsibility of
the application to know this, the PMD can't have any responsibility for this.
The next op sent to the PMD must start with a valid deflate block.


> >>
> >> Thanks
> >> Shally
> >>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-14  7:41               ` Verma, Shally
@ 2018-02-15 18:47                 ` Trahe, Fiona
  2018-02-15 21:09                   ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Trahe, Fiona @ 2018-02-15 18:47 UTC (permalink / raw)
  To: Verma, Shally, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona

Hi Shally, Ahmed, 
Sorry for the delay in replying,
Comments below

> -----Original Message-----
> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> Sent: Wednesday, February 14, 2018 7:41 AM
> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
> dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: RE: [RFC v2] doc compression API for DPDK
> 
> Hi Ahmed,
> 
> >-----Original Message-----
> >From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> >Sent: 02 February 2018 01:53
> >To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
> >Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila
> ><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
> Mahipal
> ><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
> <hemant.agrawal@nxp.com>; Roy
> >Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >Subject: Re: [RFC v2] doc compression API for DPDK
> >
> >On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
> >> Hi Ahmed, Shally,
> >>
> >> ///snip///
> >>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
> >>>>>>>> ------------------------------------------------
> >>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
> >>>>> and
> >>>>>>> where PMD still has more data to produce. If PMD run into such
> >>>>> condition,
> >>>>>>> then it's an error condition in stateless processing.
> >>>>>>>> In such case, PMD resets itself and return with status
> >>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
> >>>>> i.e.
> >>>>>>> no input read, no output written.
> >>>>>>>> Application can resubmit an full input with larger output buffer size.
> >>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
> >>>>> was
> >>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
> >>>>>>> decompression applications doing search.
> >>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
> >>>>> handling out_of_space under
> >>>>>> "Stateful Section").
> >>>>>> By definition, "stateless" here means that application (such as IPCOMP)
> >>>>> knows maximum output size
> >>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
> >>>>> than provided output buffer.
> >>>>>> Such apps can submit an op with type = STATELESS and provide full input,
> >>>>> then PMD assume it has
> >>>>>> sufficient input and output and thus doesn't need to maintain any contexts
> >>>>> after op is processed.
> >>>>>> If application doesn't know about max output size, then it should process it
> >>>>> as stateful op i.e. setup op
> >>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
> >>>>> relevant context to handle such
> >>>>>> condition.
> >>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
> >>>>> respecting the stateless concept.
> >>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
> >>>>> case
> >>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
> >>>>> amount of valid data which has
> >>>>>  been written to the output buffer. It is not complete, but if an application
> >>>>> wants to search it it may be sufficient.
> >>>>> If the application still wants the data it must resubmit the whole input with a
> >>>>> bigger output buffer, and
> >>>>>  decompression will be repeated from the start, it
> >>>>>  cannot expect to continue on as the PMD has not maintained state, history
> >>>>> or data.
> >>>>> I don't think there would be any need to indicate this in capabilities, PMDs
> >>>>> which cannot provide this
> >>>>> functionality would always return produced=consumed=0, while PMDs which
> >>>>> can could set produced > 0.
> >>>>> If this works for you both, we could consider a similar case for compression.
> >>>>>
> >>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
> >>> consumed by PMD.
> >>>> Setting consumed = 0 with produced > 0 doesn't correlate.
> >>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
> >>> of returning consumed = 0. At the same time returning consumed = y
> >>> implies to the user that it can proceed from the middle. I prefer the
> >>> consumed = 0 implementation, but I think a different return is needed to
> >>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
> >>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
> >>> future PMD implementations to provide recover-ability even in STATELESS
> >>> mode if they so wish. In this model STATELESS or STATEFUL would be a
> >>> hint for the PMD implementation to make optimizations for each case, but
> >>> it does not force the PMD implementation to limit functionality if it
> >>> can provide recover-ability.
> >> [Fiona] So you're suggesting the following:
> >> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
> >>     can be used and next op in stream should continue on from op.consumed+1.
> >> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
> >>     Error condition, no recovery possible.
> >>     consumed=produced=0. Application must resubmit all input data with
> >>     a bigger output buffer.
> >> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
> >>      - consumed = 0, produced > 0. Application must resubmit all input data with
> >>         a bigger output buffer. However in decompression case, data up to produced
> >>         in dst buffer may be inspected/searched. Never happens in compression
> >>         case as output data would be meaningless.
> >>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
> >>         can convert to stateful, using op.produced and continuing from consumed+1.
> >> I don't expect our PMDs to use this last case, but maybe this works for others?
> >> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
> >> without a stream, and maybe less efficient?
> >> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
> >> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
> >> simply have submitted a STATEFUL request if this is the behaviour it wants?
> >[Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
> >and replacing it with
> >OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
> >        Error condition, no recovery possible.
> >        - consumed=0 produced=amount of data produced. Application must
> >resubmit all input data with
> >          a bigger output buffer to process all of the op
> >OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
> >an error. Op.produced
> >    can be used and next op in stream should continue on from op.consumed+1.
> >        -  consumed > 0, produced > 0. PMD has stored relevant state and
> >history and so
> >            can continue using op.produced and continuing from consumed+1.
> >
> >We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
> >implementation either.
> >
> >Regardless of speculative future PMDs. The more important aspect of this
> >for today is that the return status clearly determines
> >the meaning of "consumed". If it is RECOVERABLE then consumed is
> >meaningful. if it is TERMINATED then consumed in meaningless.
> >This way we take away the ambiguity of having OUT_OF_SPACE mean two
> >different user work flows.
> >
> >A speculative future PMD may be designed to return RECOVERABLE for
> >stateless ops that are attached to streams.
> >A future PMD may look to see if an op has a stream is attached and write
> >out the state there and go into recoverable mode.
> >in essence this leaves the choice up to the implementation and allows
> >the PMD to take advantage of stateless optimizations
> >so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
> >context as soon as it fully processes an op. It will only
> >write context out in cases where the op chokes.
> >This futuristic PMD should ignore the FLUSH since this STATELESS mode as
> >indicated by the user and optimize
> 
> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
> definition as you mentioned and seem doable.
> So then it mean all following conditions:
> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
> has to start all over again, it's a failure (as in current definition)
> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
> states in stream pointer)
> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature flag
> enabled or not
[Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of space
should ignore the flush flag. 
Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
 Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
 out-of space, but unsupported or similar. Or it would fail on stream creation.

> 
> and one more exception case is:
> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
> internally maintained that state somehow and consumed & produced > 0, so user can start consumed+1
> but there's restriction on user not to alter or change op until it is fully processed?!
[Fiona] Why the need for this case? 
There's always a restriction on user not to alter or change op until it is fully processed.
If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?

> 
> API currently takes care of case a and c, and case b can be supported if specification accept another
> proposal which mention optional usage of stream with stateless.
[Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type 
parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that 
stream must be attached to ops.

 Until then API takes no difference to
> case b and c i.e. we can have op such as,
> - type= stateful with flush = full/final, stream pointer provided, PMD can return
> TERMINATED/RECOVERABLE according to its ability
> 
> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
> doable with concept of different return code.
> 
[Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
    has to start all over again, it's a failure (as in current definition). 
    consumed = 0, produced=amount of data produced. This is usually 0, but in decompression 
    case a PMD may return > 0 and application may find it useful to inspect that data.
b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
    TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
    states in stream pointer)
c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
    op.produced can be used and next op in stream should continue on from op.consumed+1.
    Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
    no need to change state to TERMINATED in this case. There may be useful state/history 
    stored in the PMD, even though no output produced yet.

> >>>>>>>> D.2 Compression API Stateful operation
> >>>>>>>> ----------------------------------------------------------
> >>>>>>>>  A Stateful operation in DPDK compression means application invokes
> >>>>>>> enqueue burst() multiple times to process related chunk of data either
> >>>>>>> because
> >>>>>>>> - Application broke data into several ops, and/or
> >>>>>>>> - PMD ran into out_of_space situation during input processing
> >>>>>>>>
> >>>>>>>> In case of either one or all of the above conditions, PMD is required to
> >>>>>>> maintain state of op across enque_burst() calls and
> >>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> >>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> >>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
> >>>>>>>> D.2.1 Stateful operation state maintenance
> >>>>>>>> ---------------------------------------------------------------
> >>>>>>>> It is always an ideal expectation from application that it should parse
> >>>>>>> through all related chunk of source data making its mbuf-chain and
> >>>>> enqueue
> >>>>>>> it for stateless processing.
> >>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
> >>>>> an
> >>>>>>> expected call flow would be something like:
> >>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> >>>>>>> burst in a loop until all ops are received. Is this correct?
> >>>>>>>
> >>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >>>>> this illustration is specifically in
> >>>>>> context of stateful op processing to reflect if a stream is broken into
> >>>>> chunks, then each chunk should be
> >>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
> >>>>> dequeued first before next chunk is
> >>>>>> enqueued.
> >>>>>>
> >>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>> enqueue_burst( |op.full_flush |)
> >>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
> >>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >>>>> distinguish
> >>>>>>> the response in exception cases?
> >>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
> >>>>> such case is independent of
> >>>>>> each other i.e. belong to different streams altogether.
> >>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >>>>> related chunks of data in single
> >>>>>> burst by passing them as ops array but later found that as not-so-useful for
> >>>>> PMD handling for various
> >>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
> >>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> >>>>> time, since each needs the
> >>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
> >>>>> force PMDs to implement internal queueing and exception handling for
> >>>>> OUT_OF_SPACE conditions you mention.
> >>> [Ahmed] But we are putting the ops on qps which would make them
> >>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> >>> complex but doable.
> >> [Fiona] In my opinion this is not doable, could be very inefficient.
> >> There may be many streams.
> >> The PMD would have to have an internal queue per stream so
> >> it could adjust the next src offset and length in the OUT_OF_SPACE case.
> >> And this may ripple back though all subsequent ops in the stream as each
> >> source len is increased and its dst buffer is not big enough.
> >[Ahmed] Regarding multi op OUT_OF_SPACE handling.
> >The caller would still need to adjust
> >the src length/output buffer as you say. The PMD cannot handle
> >OUT_OF_SPACE internally.
> >After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
> >until it gets explicit
> >confirmation from the caller to continue working on this stream. Any ops
> >received by
> >the PMD should be returned to the caller with status STREAM_PAUSED since
> >the caller did not
> >explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
> >These semantics can be enabled by adding a new function to the API
> >perhaps stream_resume().
> >This allows the caller to indicate that it acknowledges that it has seen
> >the issue and this op
> >should be used to resolve the issue. Implementations that do not support
> >this mode of use
> >can push back immediately after one op is in flight. Implementations
> >that support this use
> >mode can allow many ops from the same session
> >
> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would still
> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think is
> the case with DPDK PMD).
> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above PMD
> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output and
> pass than as one op.
> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
> impacting aspect also?
> 
> if it is in context where each op belong to different stream in a burst, then why do we need
> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed + 1
> from next call onwards as it has already
> seen OUT_OF_SPACE.
>
[Fiona] I still have concerns with this and would not want to support in our PMD.
TO make sure I understand, you want to send a burst of ops, with several from same stream.
If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any 
subsequent ops in that stream. 
Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
Or somehow drop them? How?
While still processing ops form other streams.
As we want to offload each op to hardware with as little CPU processing as possible we
would not want to open up each op to see which stream it's attached to and
make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.

Maybe we could add a capability if this behaviour is important for you?
e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
Our PMD would set this to 0. And expect no more than one op from a stateful stream
to be in flight at any time.  

 
> >Regarding the ordering of ops
> >We do force serialization of ops belonging to a stream in STATEFUL
> >operation. Related ops do
> >not go out of order and are given to available PMDs one at a time.
> >
> >>> The question is this mode of use useful for real
> >>> life applications or would we be just adding complexity? The technical
> >>> advantage of this is that processing of Stateful ops is interdependent
> >>> and PMDs can take advantage of caching and other optimizations to make
> >>> processing related ops much faster than switching on every op. PMDs have
> >>> maintain state of more than 32 KB for DEFLATE for every stream.
> >>>>> If the application has all the data, it can put it into chained mbufs in a single
> >>>>> op rather than
> >>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
> >>> [Ahmed] I think that your suggested scheme of putting all related mbufs
> >>> into one op may be the best solution without the extra complexity of
> >>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
> >>> time If we have a way of marking mbufs as ready for consumption. The
> >>> enqueuer may not have all the data at hand but can enqueue the op with a
> >>> couple of empty mbus marked as not ready for consumption. The enqueuer
> >>> will then update the rest of the mbufs to ready for consumption once the
> >>> data is added. This introduces a race condition. A second flag for each
> >>> mbuf can be updated by the PMD to indicate that it processed it or not.
> >>> This way in cases where the PMD beat the application to the op, the
> >>> application will just update the op to point to the first unprocessed
> >>> mbuf and resend it to the PMD.
> >> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
> >> enqueued the op. You would have to write to op.src.length at a time when the PMD
> >> might be reading it. Sounds like a lock would be necessary.
> >> Once the op has been enqueued, my understanding is its ownership is handed
> >> over to the PMD and the application should not touch it until it has been dequeued.
> >> I don't think it's a good idea to change this model.
> >> Can't the application just collect a stream of data in chained mbufs until it has
> >> enough to send an op, then construct the op and while waiting for that op to
> >> complete, accumulate the next batch of chained mbufs? Only construct the next op
> >> after the previous one is complete, based on the result of the previous one.
> >>
> >[Ahmed] Fair enough. I agree with you. I imagined it in a different way
> >in which each mbuf would have its own length.
> >The advantage to gain is in applications where there is one PMD user,
> >the down time between ops can be significant and setting up a single
> >producer consumer pair significantly reduces the CPU cycles and PMD down
> >time.
> >
> >////snip////

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-15 17:20                         ` Trahe, Fiona
@ 2018-02-15 19:51                           ` Ahmed Mansour
  2018-02-16 11:11                             ` Trahe, Fiona
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-15 19:51 UTC (permalink / raw)
  To: Trahe, Fiona, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

/// snip ///
>>>>>
>>>>>>>> [Fiona] I propose if BFINAL bit is detected before end of input
>>>>>>>> the decompression should stop. In this case consumed will be < src.length.
>>>>>>>> produced will be < dst buffer size. Do we need an extra STATUS response?
>>>>>>>> STATUS_BFINAL_DETECTED  ?
>>>>>>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
>>>>>> [Fiona] Yes.
>>>>>>
>>>>>>  And if yes,
>>>>>>> and if it can process that final block successfully/unsuccessfully, then status could simply be
>>>>>>> SUCCESS/FAILED.
>>>>>>> I don't see need of specific return code for this use case. Just to share, in past, we have practically
>> run into
>>>>>>> such cases with boost lib, and decompressor has simply worked this way.
>>>>>> [Fiona] I'm ok with this.
>>>>>>
>>>>>>>> Only thing I don't like this is it can impact on performance, as normally
>>>>>>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
>>>>>>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
>>>>>>>> Do you have a suggestion on how we should handle this?
>>>>>>>>
>>>>> [Ahmed] This makes sense. So in all cases the PMD should assume that it
>>>>> should stop as soon as a BFINAL is observed.
>>>>>
>>>>> A question. What happens ins stateful vs stateless modes when
>>>>> decompressing an op that encompasses multiple BFINALs. I assume the
>>>>> caller in that case will use the consumed=x bytes to find out how far in
>>>>> to the input is the end of the first stream and start from the next
>>>>> byte. Is this correct?
>>>> [Shally]  As per my understanding, each op can be tied up to only one stream as we have only one
>> stream pointer per op and one
>>> stream can have only one BFINAL (as stream is one complete compressed data) but looks like you're
>> suggesting a case where one op
>>> can carry multiple independent streams? and thus multiple BFINAL?! , such as, below here is op
>> pointing to more than one streams
>>>>             --------------------------------------------
>>>> op --> |stream1|stream2| |stream3|
>>>>            --------------------------------------------
>>>>
>>>> Could you confirm if I understand your question correct?
>>> [Ahmed] Correct. We found that in some storage applications the user
>>> does not know where exactly the BFINAL is. They rely on zlib software
>>> today. zlib.net software halts at the first BFINAL. Users put multiple
>>> streams in one op and rely on zlib to  stop and inform them of the end
>>> location of the first stream.
>> [Shally] Then this is practically case possible on decompressor and decompressor doesn't regard flush
>> flag. So in that case, I expect PMD to internally reset themselves (say in case of zlib going through cycle
>> of deflateEnd and deflateInit or deflateReset) and return with status = SUCCESS with updated produced
>> and consumed. Now in such case, if previous stream also has some footer followed by start of next
>> stream, then I am not sure how PMD / lib can support that case. Have you had practically run of such
>> use-case on zlib? If yes, how then such application handle it in your experience?
>> I can imagine for such input zlib would return with Z_FLUSH_END after 1st BFINAL is processed to the
>> user. Then application doing deflateReset() or Init-End() cycle before starting with next. But if it starts
>> with input that doesn't have valid zlib header, then likely it will throw an error.
>>
> [Fiona] The consumed and produced tell the Application hw much data was processed up to 
> the end of the first deflate block encountered with a bfinal set.
> If there is data, e.g. footer after the block with bfinal, then I think it must be the responsibility of
> the application to know this, the PMD can't have any responsibility for this.
> The next op sent to the PMD must start with a valid deflate block.
[Ahmed] Agreed. This is exactly what I expected. In our case we support
gzip and zlib header/footer processing, but that does not fundamentally
change the setup. The user may have other meta data after the footer
which the PMD is not responsible for. The PMD should stop processing
depending on the mode. In raw DEFLATE, it should stop immediately. In
other modes it should stop after the footer. We also have a mode in our
PMD to simply continue decompression. In that case there cannot be
header/footer between streams in raw DEFLATE. That mode can be enabled
perhaps at the session level in the future with a session parameter at
setup time. We call it "member continue". In this mode the PMD plows
through as much of the op as possible. If it hits incorrectly setup data
then it returns what it did decompress successfully and the error code
in decompressing the data afterwards.
>
>
>>>> Thanks
>>>> Shally
>>>>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-15 18:47                 ` Trahe, Fiona
@ 2018-02-15 21:09                   ` Ahmed Mansour
  2018-02-16  7:16                     ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-15 21:09 UTC (permalink / raw)
  To: Trahe, Fiona, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

On 2/15/2018 1:47 PM, Trahe, Fiona wrote:
> Hi Shally, Ahmed, 
> Sorry for the delay in replying,
> Comments below
>
>> -----Original Message-----
>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>> Sent: Wednesday, February 14, 2018 7:41 AM
>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
>> dev@dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>> Subject: RE: [RFC v2] doc compression API for DPDK
>>
>> Hi Ahmed,
>>
>>> -----Original Message-----
>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>> Sent: 02 February 2018 01:53
>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
>> Mahipal
>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
>> <hemant.agrawal@nxp.com>; Roy
>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>
>>> On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
>>>> Hi Ahmed, Shally,
>>>>
>>>> ///snip///
>>>>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>>>>>>> and
>>>>>>>>> where PMD still has more data to produce. If PMD run into such
>>>>>>> condition,
>>>>>>>>> then it's an error condition in stateless processing.
>>>>>>>>>> In such case, PMD resets itself and return with status
>>>>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>>>>>>> i.e.
>>>>>>>>> no input read, no output written.
>>>>>>>>>> Application can resubmit an full input with larger output buffer size.
>>>>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
>>>>>>> was
>>>>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>>>>>>> decompression applications doing search.
>>>>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
>>>>>>> handling out_of_space under
>>>>>>>> "Stateful Section").
>>>>>>>> By definition, "stateless" here means that application (such as IPCOMP)
>>>>>>> knows maximum output size
>>>>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
>>>>>>> than provided output buffer.
>>>>>>>> Such apps can submit an op with type = STATELESS and provide full input,
>>>>>>> then PMD assume it has
>>>>>>>> sufficient input and output and thus doesn't need to maintain any contexts
>>>>>>> after op is processed.
>>>>>>>> If application doesn't know about max output size, then it should process it
>>>>>>> as stateful op i.e. setup op
>>>>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
>>>>>>> relevant context to handle such
>>>>>>>> condition.
>>>>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>>>>>>> respecting the stateless concept.
>>>>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>>>>>>> case
>>>>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
>>>>>>> amount of valid data which has
>>>>>>>  been written to the output buffer. It is not complete, but if an application
>>>>>>> wants to search it it may be sufficient.
>>>>>>> If the application still wants the data it must resubmit the whole input with a
>>>>>>> bigger output buffer, and
>>>>>>>  decompression will be repeated from the start, it
>>>>>>>  cannot expect to continue on as the PMD has not maintained state, history
>>>>>>> or data.
>>>>>>> I don't think there would be any need to indicate this in capabilities, PMDs
>>>>>>> which cannot provide this
>>>>>>> functionality would always return produced=consumed=0, while PMDs which
>>>>>>> can could set produced > 0.
>>>>>>> If this works for you both, we could consider a similar case for compression.
>>>>>>>
>>>>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
>>>>> consumed by PMD.
>>>>>> Setting consumed = 0 with produced > 0 doesn't correlate.
>>>>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>>>>> of returning consumed = 0. At the same time returning consumed = y
>>>>> implies to the user that it can proceed from the middle. I prefer the
>>>>> consumed = 0 implementation, but I think a different return is needed to
>>>>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>>>>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>>>>> future PMD implementations to provide recover-ability even in STATELESS
>>>>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>>>>> hint for the PMD implementation to make optimizations for each case, but
>>>>> it does not force the PMD implementation to limit functionality if it
>>>>> can provide recover-ability.
>>>> [Fiona] So you're suggesting the following:
>>>> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>>>>     can be used and next op in stream should continue on from op.consumed+1.
>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>     Error condition, no recovery possible.
>>>>     consumed=produced=0. Application must resubmit all input data with
>>>>     a bigger output buffer.
>>>> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>>>>      - consumed = 0, produced > 0. Application must resubmit all input data with
>>>>         a bigger output buffer. However in decompression case, data up to produced
>>>>         in dst buffer may be inspected/searched. Never happens in compression
>>>>         case as output data would be meaningless.
>>>>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>>>>         can convert to stateful, using op.produced and continuing from consumed+1.
>>>> I don't expect our PMDs to use this last case, but maybe this works for others?
>>>> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
>>>> without a stream, and maybe less efficient?
>>>> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
>>>> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
>>>> simply have submitted a STATEFUL request if this is the behaviour it wants?
>>> [Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
>>> and replacing it with
>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>        Error condition, no recovery possible.
>>>        - consumed=0 produced=amount of data produced. Application must
>>> resubmit all input data with
>>>          a bigger output buffer to process all of the op
>>> OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
>>> an error. Op.produced
>>>    can be used and next op in stream should continue on from op.consumed+1.
>>>        -  consumed > 0, produced > 0. PMD has stored relevant state and
>>> history and so
>>>            can continue using op.produced and continuing from consumed+1.
>>>
>>> We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
>>> implementation either.
>>>
>>> Regardless of speculative future PMDs. The more important aspect of this
>>> for today is that the return status clearly determines
>>> the meaning of "consumed". If it is RECOVERABLE then consumed is
>>> meaningful. if it is TERMINATED then consumed in meaningless.
>>> This way we take away the ambiguity of having OUT_OF_SPACE mean two
>>> different user work flows.
>>>
>>> A speculative future PMD may be designed to return RECOVERABLE for
>>> stateless ops that are attached to streams.
>>> A future PMD may look to see if an op has a stream is attached and write
>>> out the state there and go into recoverable mode.
>>> in essence this leaves the choice up to the implementation and allows
>>> the PMD to take advantage of stateless optimizations
>>> so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
>>> context as soon as it fully processes an op. It will only
>>> write context out in cases where the op chokes.
>>> This futuristic PMD should ignore the FLUSH since this STATELESS mode as
>>> indicated by the user and optimize
>> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
>> definition as you mentioned and seem doable.
>> So then it mean all following conditions:
>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
>> has to start all over again, it's a failure (as in current definition)
>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
>> states in stream pointer)
>> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
>> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature flag
>> enabled or not
> [Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of space
> should ignore the flush flag. 
> Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
>  Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
>  out-of space, but unsupported or similar. Or it would fail on stream creation.
[Ahmed] Agreed with Fiona. The flush flag only matters on success. By
definition the PMD should return OUT_OF_SPACE_RECOVERABLE in stateful
mode when it runs out of space.
@Shally If the user did not provide a stream, then the PMD should
probably return TERMINATED every time. I am not sure we should make a
"really smart" PMD which returns RECOVERABLE even if no stream pointer
was given. In that case the PMD must give some ID back to the caller
that the caller can use to "recover" the op. I am not sure how it would
be implemented in the PMD and when does the PMD decide to retire streams
belonging to dead ops that the caller decided not to "recover".
>
>> and one more exception case is:
>> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
>> internally maintained that state somehow and consumed & produced > 0, so user can start consumed+1
>> but there's restriction on user not to alter or change op until it is fully processed?!
> [Fiona] Why the need for this case? 
> There's always a restriction on user not to alter or change op until it is fully processed.
> If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?
[Ahmed] Agreed. The user should not touch an op once enqueued until they
receive it in dequeue. We ignore the flush in stateless mode. We assume
it to be final every time.
>
>> API currently takes care of case a and c, and case b can be supported if specification accept another
>> proposal which mention optional usage of stream with stateless.
> [Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type 
> parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that 
> stream must be attached to ops.
>
>  Until then API takes no difference to
>> case b and c i.e. we can have op such as,
>> - type= stateful with flush = full/final, stream pointer provided, PMD can return
>> TERMINATED/RECOVERABLE according to its ability
>>
>> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
>> doable with concept of different return code.
>>
> [Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
>     has to start all over again, it's a failure (as in current definition). 
>     consumed = 0, produced=amount of data produced. This is usually 0, but in decompression 
>     case a PMD may return > 0 and application may find it useful to inspect that data.
> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>     TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
>     states in stream pointer)
> c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
>     op.produced can be used and next op in stream should continue on from op.consumed+1.
>     Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
>     no need to change state to TERMINATED in this case. There may be useful state/history 
>     stored in the PMD, even though no output produced yet.
[Ahmed] Agreed
>
>>>>>>>>>> D.2 Compression API Stateful operation
>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>  A Stateful operation in DPDK compression means application invokes
>>>>>>>>> enqueue burst() multiple times to process related chunk of data either
>>>>>>>>> because
>>>>>>>>>> - Application broke data into several ops, and/or
>>>>>>>>>> - PMD ran into out_of_space situation during input processing
>>>>>>>>>>
>>>>>>>>>> In case of either one or all of the above conditions, PMD is required to
>>>>>>>>> maintain state of op across enque_burst() calls and
>>>>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>> enqueue
>>>>>>>>> it for stateless processing.
>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>> an
>>>>>>>>> expected call flow would be something like:
>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>
>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>> this illustration is specifically in
>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>> chunks, then each chunk should be
>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>> dequeued first before next chunk is
>>>>>>>> enqueued.
>>>>>>>>
>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>> distinguish
>>>>>>>>> the response in exception cases?
>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>> such case is independent of
>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>> related chunks of data in single
>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>> PMD handling for various
>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>> time, since each needs the
>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>> complex but doable.
>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>> There may be many streams.
>>>> The PMD would have to have an internal queue per stream so
>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>> And this may ripple back though all subsequent ops in the stream as each
>>>> source len is increased and its dst buffer is not big enough.
>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>> The caller would still need to adjust
>>> the src length/output buffer as you say. The PMD cannot handle
>>> OUT_OF_SPACE internally.
>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>> until it gets explicit
>>> confirmation from the caller to continue working on this stream. Any ops
>>> received by
>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>> the caller did not
>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>> These semantics can be enabled by adding a new function to the API
>>> perhaps stream_resume().
>>> This allows the caller to indicate that it acknowledges that it has seen
>>> the issue and this op
>>> should be used to resolve the issue. Implementations that do not support
>>> this mode of use
>>> can push back immediately after one op is in flight. Implementations
>>> that support this use
>>> mode can allow many ops from the same session
>>>
>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would still
>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think is
>> the case with DPDK PMD).
>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above PMD
>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output and
>> pass than as one op.
>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>> impacting aspect also?
>>
>> if it is in context where each op belong to different stream in a burst, then why do we need
>> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed + 1
>> from next call onwards as it has already
>> seen OUT_OF_SPACE.
[Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
implementation rejects all ops that belong to a stream that has entered
"RECOVERABLE" state for one reason or another. The caller must
acknowledge explicitly that it has received news of the problem before
the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
that implementing this functionality in the software layer above the PMD
is a bad idea since the latency reductions are lost.
This setup is useful in latency sensitive applications where the latency
of buffering multiple ops into one op is significant. We found latency
makes a significant difference in search applications where the PMD
competes with software decompression.
> [Fiona] I still have concerns with this and would not want to support in our PMD.
> TO make sure I understand, you want to send a burst of ops, with several from same stream.
> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any 
> subsequent ops in that stream. 
> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
> Or somehow drop them? How?
> While still processing ops form other streams.
[Ahmed] This is exactly correct. It should return them with
NOT_PROCESSED. Yes, the PMD should continue processing other streams.
> As we want to offload each op to hardware with as little CPU processing as possible we
> would not want to open up each op to see which stream it's attached to and
> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.
[Ahmed] I think I might have missed your point here, but I will try to
answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
to the PMD and the PMD should reject until stream_continue() is called.
The next op to be sent by the user will have a special marker in it to
inform the PMD to continue working on this stream. Alternatively the
DPDK layer can be made "smarter" to fail during the enqueue by checking
the stream and its state, but like you say this adds additional CPU
overhead during the enqueue.
I am curious. In a simple synchronous use case. How do we prevent users
from putting multiple ops in flight that belong to a single stream? Do
we just currently say it is undefined behavior? Otherwise we would have
to check the stream and incur the CPU overhead.
>
> Maybe we could add a capability if this behaviour is important for you?
> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
> Our PMD would set this to 0. And expect no more than one op from a stateful stream
> to be in flight at any time.  
[Ahmed] That makes sense. This way the different DPDK implementations do
not have to add extra checking for unsupported cases.
>
>  
>>> Regarding the ordering of ops
>>> We do force serialization of ops belonging to a stream in STATEFUL
>>> operation. Related ops do
>>> not go out of order and are given to available PMDs one at a time.
>>>
>>>>> The question is this mode of use useful for real
>>>>> life applications or would we be just adding complexity? The technical
>>>>> advantage of this is that processing of Stateful ops is interdependent
>>>>> and PMDs can take advantage of caching and other optimizations to make
>>>>> processing related ops much faster than switching on every op. PMDs have
>>>>> maintain state of more than 32 KB for DEFLATE for every stream.
>>>>>>> If the application has all the data, it can put it into chained mbufs in a single
>>>>>>> op rather than
>>>>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
>>>>> [Ahmed] I think that your suggested scheme of putting all related mbufs
>>>>> into one op may be the best solution without the extra complexity of
>>>>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
>>>>> time If we have a way of marking mbufs as ready for consumption. The
>>>>> enqueuer may not have all the data at hand but can enqueue the op with a
>>>>> couple of empty mbus marked as not ready for consumption. The enqueuer
>>>>> will then update the rest of the mbufs to ready for consumption once the
>>>>> data is added. This introduces a race condition. A second flag for each
>>>>> mbuf can be updated by the PMD to indicate that it processed it or not.
>>>>> This way in cases where the PMD beat the application to the op, the
>>>>> application will just update the op to point to the first unprocessed
>>>>> mbuf and resend it to the PMD.
>>>> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
>>>> enqueued the op. You would have to write to op.src.length at a time when the PMD
>>>> might be reading it. Sounds like a lock would be necessary.
>>>> Once the op has been enqueued, my understanding is its ownership is handed
>>>> over to the PMD and the application should not touch it until it has been dequeued.
>>>> I don't think it's a good idea to change this model.
>>>> Can't the application just collect a stream of data in chained mbufs until it has
>>>> enough to send an op, then construct the op and while waiting for that op to
>>>> complete, accumulate the next batch of chained mbufs? Only construct the next op
>>>> after the previous one is complete, based on the result of the previous one.
>>>>
>>> [Ahmed] Fair enough. I agree with you. I imagined it in a different way
>>> in which each mbuf would have its own length.
>>> The advantage to gain is in applications where there is one PMD user,
>>> the down time between ops can be significant and setting up a single
>>> producer consumer pair significantly reduces the CPU cycles and PMD down
>>> time.
>>>
>>> ////snip////



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-15 21:09                   ` Ahmed Mansour
@ 2018-02-16  7:16                     ` Verma, Shally
  2018-02-16 13:04                       ` Trahe, Fiona
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-16  7:16 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

Hi Fiona, Ahmed

>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>Sent: 16 February 2018 02:40
>To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>On 2/15/2018 1:47 PM, Trahe, Fiona wrote:
>> Hi Shally, Ahmed,
>> Sorry for the delay in replying,
>> Comments below
>>
>>> -----Original Message-----
>>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>>> Sent: Wednesday, February 14, 2018 7:41 AM
>>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
>>> dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: RE: [RFC v2] doc compression API for DPDK
>>>
>>> Hi Ahmed,
>>>
>>>> -----Original Message-----
>>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>>> Sent: 02 February 2018 01:53
>>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
>>> Mahipal
>>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
>>> <hemant.agrawal@nxp.com>; Roy
>>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>>
>>>> On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
>>>>> Hi Ahmed, Shally,
>>>>>
>>>>> ///snip///
>>>>>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>>>>>>>> and
>>>>>>>>>> where PMD still has more data to produce. If PMD run into such
>>>>>>>> condition,
>>>>>>>>>> then it's an error condition in stateless processing.
>>>>>>>>>>> In such case, PMD resets itself and return with status
>>>>>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>>>>>>>> i.e.
>>>>>>>>>> no input read, no output written.
>>>>>>>>>>> Application can resubmit an full input with larger output buffer size.
>>>>>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
>>>>>>>> was
>>>>>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>>>>>>>> decompression applications doing search.
>>>>>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
>>>>>>>> handling out_of_space under
>>>>>>>>> "Stateful Section").
>>>>>>>>> By definition, "stateless" here means that application (such as IPCOMP)
>>>>>>>> knows maximum output size
>>>>>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
>>>>>>>> than provided output buffer.
>>>>>>>>> Such apps can submit an op with type = STATELESS and provide full input,
>>>>>>>> then PMD assume it has
>>>>>>>>> sufficient input and output and thus doesn't need to maintain any contexts
>>>>>>>> after op is processed.
>>>>>>>>> If application doesn't know about max output size, then it should process it
>>>>>>>> as stateful op i.e. setup op
>>>>>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
>>>>>>>> relevant context to handle such
>>>>>>>>> condition.
>>>>>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>>>>>>>> respecting the stateless concept.
>>>>>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>>>>>>>> case
>>>>>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
>>>>>>>> amount of valid data which has
>>>>>>>>  been written to the output buffer. It is not complete, but if an application
>>>>>>>> wants to search it it may be sufficient.
>>>>>>>> If the application still wants the data it must resubmit the whole input with a
>>>>>>>> bigger output buffer, and
>>>>>>>>  decompression will be repeated from the start, it
>>>>>>>>  cannot expect to continue on as the PMD has not maintained state, history
>>>>>>>> or data.
>>>>>>>> I don't think there would be any need to indicate this in capabilities, PMDs
>>>>>>>> which cannot provide this
>>>>>>>> functionality would always return produced=consumed=0, while PMDs which
>>>>>>>> can could set produced > 0.
>>>>>>>> If this works for you both, we could consider a similar case for compression.
>>>>>>>>
>>>>>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
>>>>>> consumed by PMD.
>>>>>>> Setting consumed = 0 with produced > 0 doesn't correlate.
>>>>>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>>>>>> of returning consumed = 0. At the same time returning consumed = y
>>>>>> implies to the user that it can proceed from the middle. I prefer the
>>>>>> consumed = 0 implementation, but I think a different return is needed to
>>>>>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>>>>>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>>>>>> future PMD implementations to provide recover-ability even in STATELESS
>>>>>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>>>>>> hint for the PMD implementation to make optimizations for each case, but
>>>>>> it does not force the PMD implementation to limit functionality if it
>>>>>> can provide recover-ability.
>>>>> [Fiona] So you're suggesting the following:
>>>>> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>>>>>     can be used and next op in stream should continue on from op.consumed+1.
>>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>>     Error condition, no recovery possible.
>>>>>     consumed=produced=0. Application must resubmit all input data with
>>>>>     a bigger output buffer.
>>>>> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>>>>>      - consumed = 0, produced > 0. Application must resubmit all input data with
>>>>>         a bigger output buffer. However in decompression case, data up to produced
>>>>>         in dst buffer may be inspected/searched. Never happens in compression
>>>>>         case as output data would be meaningless.
>>>>>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>>>>>         can convert to stateful, using op.produced and continuing from consumed+1.
>>>>> I don't expect our PMDs to use this last case, but maybe this works for others?
>>>>> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
>>>>> without a stream, and maybe less efficient?
>>>>> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
>>>>> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
>>>>> simply have submitted a STATEFUL request if this is the behaviour it wants?
>>>> [Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
>>>> and replacing it with
>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>        Error condition, no recovery possible.
>>>>        - consumed=0 produced=amount of data produced. Application must
>>>> resubmit all input data with
>>>>          a bigger output buffer to process all of the op
>>>> OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
>>>> an error. Op.produced
>>>>    can be used and next op in stream should continue on from op.consumed+1.
>>>>        -  consumed > 0, produced > 0. PMD has stored relevant state and
>>>> history and so
>>>>            can continue using op.produced and continuing from consumed+1.
>>>>
>>>> We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
>>>> implementation either.
>>>>
>>>> Regardless of speculative future PMDs. The more important aspect of this
>>>> for today is that the return status clearly determines
>>>> the meaning of "consumed". If it is RECOVERABLE then consumed is
>>>> meaningful. if it is TERMINATED then consumed in meaningless.
>>>> This way we take away the ambiguity of having OUT_OF_SPACE mean two
>>>> different user work flows.
>>>>
>>>> A speculative future PMD may be designed to return RECOVERABLE for
>>>> stateless ops that are attached to streams.
>>>> A future PMD may look to see if an op has a stream is attached and write
>>>> out the state there and go into recoverable mode.
>>>> in essence this leaves the choice up to the implementation and allows
>>>> the PMD to take advantage of stateless optimizations
>>>> so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
>>>> context as soon as it fully processes an op. It will only
>>>> write context out in cases where the op chokes.
>>>> This futuristic PMD should ignore the FLUSH since this STATELESS mode as
>>>> indicated by the user and optimize
>>> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
>>> definition as you mentioned and seem doable.
>>> So then it mean all following conditions:
>>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
>>> has to start all over again, it's a failure (as in current definition)
>>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>>> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
>>> states in stream pointer)
>>> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
>>> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature flag
>>> enabled or not
>> [Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of space
>> should ignore the flush flag.
>> Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
>>  Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
>>  out-of space, but unsupported or similar. Or it would fail on stream creation.
>[Ahmed] Agreed with Fiona. The flush flag only matters on success. By
>definition the PMD should return OUT_OF_SPACE_RECOVERABLE in stateful
>mode when it runs out of space.
>@Shally If the user did not provide a stream, then the PMD should
>probably return TERMINATED every time. I am not sure we should make a
>"really smart" PMD which returns RECOVERABLE even if no stream pointer
>was given. In that case the PMD must give some ID back to the caller
>that the caller can use to "recover" the op. I am not sure how it would
>be implemented in the PMD and when does the PMD decide to retire streams
>belonging to dead ops that the caller decided not to "recover".
>>
>>> and one more exception case is:
>>> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
>>> internally maintained that state somehow and consumed & produced > 0, so user can start consumed+1
>>> but there's restriction on user not to alter or change op until it is fully processed?!
>> [Fiona] Why the need for this case?
>> There's always a restriction on user not to alter or change op until it is fully processed.
>> If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?
>[Ahmed] Agreed. The user should not touch an op once enqueued until they
>receive it in dequeue. We ignore the flush in stateless mode. We assume
>it to be final every time.

[Shally] Agreed and am not in favour of supporting such implementation either. Just listed out different possibilities up here to better visualise Ahmed requirements/applicability of TERMINATED and RECOVERABLE.

>>
>>> API currently takes care of case a and c, and case b can be supported if specification accept another
>>> proposal which mention optional usage of stream with stateless.
>> [Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type
>> parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that
>> stream must be attached to ops.
>>
>>  Until then API takes no difference to
>>> case b and c i.e. we can have op such as,
>>> - type= stateful with flush = full/final, stream pointer provided, PMD can return
>>> TERMINATED/RECOVERABLE according to its ability
>>>
>>> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
>>> doable with concept of different return code.
>>>
>> [Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
>>     has to start all over again, it's a failure (as in current definition).
>>     consumed = 0, produced=amount of data produced. This is usually 0, but in decompression
>>     case a PMD may return > 0 and application may find it useful to inspect that data.
>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>>     TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
>>     states in stream pointer)
>> c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
>>     op.produced can be used and next op in stream should continue on from op.consumed+1.
>>     Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
>>     no need to change state to TERMINATED in this case. There may be useful state/history
>>     stored in the PMD, even though no output produced yet.
>[Ahmed] Agreed
[Shally] Sounds good.

>>
>>>>>>>>>>> D.2 Compression API Stateful operation
>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>  A Stateful operation in DPDK compression means application invokes
>>>>>>>>>> enqueue burst() multiple times to process related chunk of data either
>>>>>>>>>> because
>>>>>>>>>>> - Application broke data into several ops, and/or
>>>>>>>>>>> - PMD ran into out_of_space situation during input processing
>>>>>>>>>>>
>>>>>>>>>>> In case of either one or all of the above conditions, PMD is required to
>>>>>>>>>> maintain state of op across enque_burst() calls and
>>>>>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>>>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>>>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>> enqueue
>>>>>>>>>> it for stateless processing.
>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>> an
>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>
>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>> this illustration is specifically in
>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>> chunks, then each chunk should be
>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>> dequeued first before next chunk is
>>>>>>>>> enqueued.
>>>>>>>>>
>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>> distinguish
>>>>>>>>>> the response in exception cases?
>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>> such case is independent of
>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>> related chunks of data in single
>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>> PMD handling for various
>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>> time, since each needs the
>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>> complex but doable.
>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>> There may be many streams.
>>>>> The PMD would have to have an internal queue per stream so
>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>> source len is increased and its dst buffer is not big enough.
>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>> The caller would still need to adjust
>>>> the src length/output buffer as you say. The PMD cannot handle
>>>> OUT_OF_SPACE internally.
>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>> until it gets explicit
>>>> confirmation from the caller to continue working on this stream. Any ops
>>>> received by
>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>> the caller did not
>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>> These semantics can be enabled by adding a new function to the API
>>>> perhaps stream_resume().
>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>> the issue and this op
>>>> should be used to resolve the issue. Implementations that do not support
>>>> this mode of use
>>>> can push back immediately after one op is in flight. Implementations
>>>> that support this use
>>>> mode can allow many ops from the same session
>>>>
>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would still
>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think is
>>> the case with DPDK PMD).
>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above PMD
>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output and
>>> pass than as one op.
>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>> impacting aspect also?
>>>
>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed + 1
>>> from next call onwards as it has already
>>> seen OUT_OF_SPACE.
>[Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>implementation rejects all ops that belong to a stream that has entered
>"RECOVERABLE" state for one reason or another. The caller must
>acknowledge explicitly that it has received news of the problem before
>the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>that implementing this functionality in the software layer above the PMD
>is a bad idea since the latency reductions are lost.

[Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity in a layer above PMD.

>This setup is useful in latency sensitive applications where the latency
>of buffering multiple ops into one op is significant. We found latency
>makes a significant difference in search applications where the PMD
>competes with software decompression.
>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>> subsequent ops in that stream.
>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>> Or somehow drop them? How?
>> While still processing ops form other streams.
>[Ahmed] This is exactly correct. It should return them with
>NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>> As we want to offload each op to hardware with as little CPU processing as possible we
>> would not want to open up each op to see which stream it's attached to and
>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.
>[Ahmed] I think I might have missed your point here, but I will try to
>answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>to the PMD and the PMD should reject until stream_continue() is called.
>The next op to be sent by the user will have a special marker in it to
>inform the PMD to continue working on this stream. Alternatively the
>DPDK layer can be made "smarter" to fail during the enqueue by checking
>the stream and its state, but like you say this adds additional CPU
>overhead during the enqueue.
>I am curious. In a simple synchronous use case. How do we prevent users
>from putting multiple ops in flight that belong to a single stream? Do
>we just currently say it is undefined behavior? Otherwise we would have
>to check the stream and incur the CPU overhead.
>>
>> Maybe we could add a capability if this behaviour is important for you?
>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>> to be in flight at any time.
>[Ahmed] That makes sense. This way the different DPDK implementations do
>not have to add extra checking for unsupported cases.

[Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
- a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please note, here burst is not carrying more than one stream)
-PMD will submit one op at a time to HW? 
-if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/ OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue count = total # of ops submitted originally with burst?
-app assumes all have been enqueued, so it go and dequeue all ops
-on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as NOT_PROCESSED with updated input and output buffer?
-repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime failure is seen, then app start whole processing all over again or just drop this burst?!

If all of above is true, then I think we should add another API such as rte_comp_enque_single_stream() which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better name is SUPPORT_ENQUEUE_SINGLE_STREAM?!


>>
>>
>>>> Regarding the ordering of ops
>>>> We do force serialization of ops belonging to a stream in STATEFUL
>>>> operation. Related ops do
>>>> not go out of order and are given to available PMDs one at a time.
>>>>
>>>>>> The question is this mode of use useful for real
>>>>>> life applications or would we be just adding complexity? The technical
>>>>>> advantage of this is that processing of Stateful ops is interdependent
>>>>>> and PMDs can take advantage of caching and other optimizations to make
>>>>>> processing related ops much faster than switching on every op. PMDs have
>>>>>> maintain state of more than 32 KB for DEFLATE for every stream.
>>>>>>>> If the application has all the data, it can put it into chained mbufs in a single
>>>>>>>> op rather than
>>>>>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
>>>>>> [Ahmed] I think that your suggested scheme of putting all related mbufs
>>>>>> into one op may be the best solution without the extra complexity of
>>>>>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
>>>>>> time If we have a way of marking mbufs as ready for consumption. The
>>>>>> enqueuer may not have all the data at hand but can enqueue the op with a
>>>>>> couple of empty mbus marked as not ready for consumption. The enqueuer
>>>>>> will then update the rest of the mbufs to ready for consumption once the
>>>>>> data is added. This introduces a race condition. A second flag for each
>>>>>> mbuf can be updated by the PMD to indicate that it processed it or not.
>>>>>> This way in cases where the PMD beat the application to the op, the
>>>>>> application will just update the op to point to the first unprocessed
>>>>>> mbuf and resend it to the PMD.
>>>>> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
>>>>> enqueued the op. You would have to write to op.src.length at a time when the PMD
>>>>> might be reading it. Sounds like a lock would be necessary.
>>>>> Once the op has been enqueued, my understanding is its ownership is handed
>>>>> over to the PMD and the application should not touch it until it has been dequeued.
>>>>> I don't think it's a good idea to change this model.
>>>>> Can't the application just collect a stream of data in chained mbufs until it has
>>>>> enough to send an op, then construct the op and while waiting for that op to
>>>>> complete, accumulate the next batch of chained mbufs? Only construct the next op
>>>>> after the previous one is complete, based on the result of the previous one.
>>>>>
>>>> [Ahmed] Fair enough. I agree with you. I imagined it in a different way
>>>> in which each mbuf would have its own length.
>>>> The advantage to gain is in applications where there is one PMD user,
>>>> the down time between ops can be significant and setting up a single
>>>> producer consumer pair significantly reduces the CPU cycles and PMD down
>>>> time.
>>>>
>>>> ////snip////
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-15 19:51                           ` Ahmed Mansour
@ 2018-02-16 11:11                             ` Trahe, Fiona
  0 siblings, 0 replies; 30+ messages in thread
From: Trahe, Fiona @ 2018-02-16 11:11 UTC (permalink / raw)
  To: Ahmed Mansour, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry



> -----Original Message-----
> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> Sent: Thursday, February 15, 2018 7:51 PM
> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: Re: [RFC v2] doc compression API for DPDK
> 
> /// snip ///
> >>>>>
> >>>>>>>> [Fiona] I propose if BFINAL bit is detected before end of input
> >>>>>>>> the decompression should stop. In this case consumed will be < src.length.
> >>>>>>>> produced will be < dst buffer size. Do we need an extra STATUS response?
> >>>>>>>> STATUS_BFINAL_DETECTED  ?
> >>>>>>> [Shally] @fiona, I assume you mean here decompressor stop after processing Final block right?
> >>>>>> [Fiona] Yes.
> >>>>>>
> >>>>>>  And if yes,
> >>>>>>> and if it can process that final block successfully/unsuccessfully, then status could simply be
> >>>>>>> SUCCESS/FAILED.
> >>>>>>> I don't see need of specific return code for this use case. Just to share, in past, we have
> practically
> >> run into
> >>>>>>> such cases with boost lib, and decompressor has simply worked this way.
> >>>>>> [Fiona] I'm ok with this.
> >>>>>>
> >>>>>>>> Only thing I don't like this is it can impact on performance, as normally
> >>>>>>>> we can just look for STATUS == SUCCESS. Anything else should be an exception.
> >>>>>>>> Now the application would have to check for SUCCESS || BFINAL_DETECTED every time.
> >>>>>>>> Do you have a suggestion on how we should handle this?
> >>>>>>>>
> >>>>> [Ahmed] This makes sense. So in all cases the PMD should assume that it
> >>>>> should stop as soon as a BFINAL is observed.
> >>>>>
> >>>>> A question. What happens ins stateful vs stateless modes when
> >>>>> decompressing an op that encompasses multiple BFINALs. I assume the
> >>>>> caller in that case will use the consumed=x bytes to find out how far in
> >>>>> to the input is the end of the first stream and start from the next
> >>>>> byte. Is this correct?
> >>>> [Shally]  As per my understanding, each op can be tied up to only one stream as we have only one
> >> stream pointer per op and one
> >>> stream can have only one BFINAL (as stream is one complete compressed data) but looks like you're
> >> suggesting a case where one op
> >>> can carry multiple independent streams? and thus multiple BFINAL?! , such as, below here is op
> >> pointing to more than one streams
> >>>>             --------------------------------------------
> >>>> op --> |stream1|stream2| |stream3|
> >>>>            --------------------------------------------
> >>>>
> >>>> Could you confirm if I understand your question correct?
> >>> [Ahmed] Correct. We found that in some storage applications the user
> >>> does not know where exactly the BFINAL is. They rely on zlib software
> >>> today. zlib.net software halts at the first BFINAL. Users put multiple
> >>> streams in one op and rely on zlib to  stop and inform them of the end
> >>> location of the first stream.
> >> [Shally] Then this is practically case possible on decompressor and decompressor doesn't regard flush
> >> flag. So in that case, I expect PMD to internally reset themselves (say in case of zlib going through
> cycle
> >> of deflateEnd and deflateInit or deflateReset) and return with status = SUCCESS with updated
> produced
> >> and consumed. Now in such case, if previous stream also has some footer followed by start of next
> >> stream, then I am not sure how PMD / lib can support that case. Have you had practically run of such
> >> use-case on zlib? If yes, how then such application handle it in your experience?
> >> I can imagine for such input zlib would return with Z_FLUSH_END after 1st BFINAL is processed to the
> >> user. Then application doing deflateReset() or Init-End() cycle before starting with next. But if it starts
> >> with input that doesn't have valid zlib header, then likely it will throw an error.
> >>
> > [Fiona] The consumed and produced tell the Application hw much data was processed up to
> > the end of the first deflate block encountered with a bfinal set.
> > If there is data, e.g. footer after the block with bfinal, then I think it must be the responsibility of
> > the application to know this, the PMD can't have any responsibility for this.
> > The next op sent to the PMD must start with a valid deflate block.
> [Ahmed] Agreed. This is exactly what I expected. In our case we support
> gzip and zlib header/footer processing, but that does not fundamentally
> change the setup. The user may have other meta data after the footer
> which the PMD is not responsible for. The PMD should stop processing
> depending on the mode. In raw DEFLATE, it should stop immediately. In
> other modes it should stop after the footer. We also have a mode in our
> PMD to simply continue decompression. In that case there cannot be
> header/footer between streams in raw DEFLATE. That mode can be enabled
> perhaps at the session level in the future with a session parameter at
> setup time. We call it "member continue". In this mode the PMD plows
> through as much of the op as possible. If it hits incorrectly setup data
> then it returns what it did decompress successfully and the error code
> in decompressing the data afterwards.
[Fiona] Yes, these would be interesting capabilities which could be 
added to the API in future releases.

> >
> >
> >>>> Thanks
> >>>> Shally
> >>>>
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-16  7:16                     ` Verma, Shally
@ 2018-02-16 13:04                       ` Trahe, Fiona
  2018-02-16 21:21                         ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Trahe, Fiona @ 2018-02-16 13:04 UTC (permalink / raw)
  To: Verma, Shally, Ahmed Mansour, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona



> -----Original Message-----
> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> Sent: Friday, February 16, 2018 7:17 AM
> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
> dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: RE: [RFC v2] doc compression API for DPDK
> 
> Hi Fiona, Ahmed
> 
> >-----Original Message-----
> >From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> >Sent: 16 February 2018 02:40
> >To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
> >Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila
> ><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
> Mahipal
> ><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
> <hemant.agrawal@nxp.com>; Roy
> >Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >Subject: Re: [RFC v2] doc compression API for DPDK
> >
> >On 2/15/2018 1:47 PM, Trahe, Fiona wrote:
> >> Hi Shally, Ahmed,
> >> Sorry for the delay in replying,
> >> Comments below
> >>
> >>> -----Original Message-----
> >>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
> >>> Sent: Wednesday, February 14, 2018 7:41 AM
> >>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
> >>> dev@dpdk.org
> >>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> >>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> >>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> >>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> >>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >>> Subject: RE: [RFC v2] doc compression API for DPDK
> >>>
> >>> Hi Ahmed,
> >>>
> >>>> -----Original Message-----
> >>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> >>>> Sent: 02 February 2018 01:53
> >>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>;
> dev@dpdk.org
> >>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> >>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
> >>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
> >>> Mahipal
> >>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
> >>> <hemant.agrawal@nxp.com>; Roy
> >>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> >>>> Subject: Re: [RFC v2] doc compression API for DPDK
> >>>>
> >>>> On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
> >>>>> Hi Ahmed, Shally,
> >>>>>
> >>>>> ///snip///
> >>>>>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
> >>>>>>>>>>> ------------------------------------------------
> >>>>>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
> >>>>>>>> and
> >>>>>>>>>> where PMD still has more data to produce. If PMD run into such
> >>>>>>>> condition,
> >>>>>>>>>> then it's an error condition in stateless processing.
> >>>>>>>>>>> In such case, PMD resets itself and return with status
> >>>>>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
> >>>>>>>> i.e.
> >>>>>>>>>> no input read, no output written.
> >>>>>>>>>>> Application can resubmit an full input with larger output buffer size.
> >>>>>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
> >>>>>>>> was
> >>>>>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
> >>>>>>>>>> decompression applications doing search.
> >>>>>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
> >>>>>>>> handling out_of_space under
> >>>>>>>>> "Stateful Section").
> >>>>>>>>> By definition, "stateless" here means that application (such as IPCOMP)
> >>>>>>>> knows maximum output size
> >>>>>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
> >>>>>>>> than provided output buffer.
> >>>>>>>>> Such apps can submit an op with type = STATELESS and provide full input,
> >>>>>>>> then PMD assume it has
> >>>>>>>>> sufficient input and output and thus doesn't need to maintain any contexts
> >>>>>>>> after op is processed.
> >>>>>>>>> If application doesn't know about max output size, then it should process it
> >>>>>>>> as stateful op i.e. setup op
> >>>>>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
> >>>>>>>> relevant context to handle such
> >>>>>>>>> condition.
> >>>>>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
> >>>>>>>> respecting the stateless concept.
> >>>>>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
> >>>>>>>> case
> >>>>>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
> >>>>>>>> amount of valid data which has
> >>>>>>>>  been written to the output buffer. It is not complete, but if an application
> >>>>>>>> wants to search it it may be sufficient.
> >>>>>>>> If the application still wants the data it must resubmit the whole input with a
> >>>>>>>> bigger output buffer, and
> >>>>>>>>  decompression will be repeated from the start, it
> >>>>>>>>  cannot expect to continue on as the PMD has not maintained state, history
> >>>>>>>> or data.
> >>>>>>>> I don't think there would be any need to indicate this in capabilities, PMDs
> >>>>>>>> which cannot provide this
> >>>>>>>> functionality would always return produced=consumed=0, while PMDs which
> >>>>>>>> can could set produced > 0.
> >>>>>>>> If this works for you both, we could consider a similar case for compression.
> >>>>>>>>
> >>>>>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to
> actual
> >>>>>> consumed by PMD.
> >>>>>>> Setting consumed = 0 with produced > 0 doesn't correlate.
> >>>>>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
> >>>>>> of returning consumed = 0. At the same time returning consumed = y
> >>>>>> implies to the user that it can proceed from the middle. I prefer the
> >>>>>> consumed = 0 implementation, but I think a different return is needed to
> >>>>>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
> >>>>>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
> >>>>>> future PMD implementations to provide recover-ability even in STATELESS
> >>>>>> mode if they so wish. In this model STATELESS or STATEFUL would be a
> >>>>>> hint for the PMD implementation to make optimizations for each case, but
> >>>>>> it does not force the PMD implementation to limit functionality if it
> >>>>>> can provide recover-ability.
> >>>>> [Fiona] So you're suggesting the following:
> >>>>> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
> >>>>>     can be used and next op in stream should continue on from op.consumed+1.
> >>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
> >>>>>     Error condition, no recovery possible.
> >>>>>     consumed=produced=0. Application must resubmit all input data with
> >>>>>     a bigger output buffer.
> >>>>> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
> >>>>>      - consumed = 0, produced > 0. Application must resubmit all input data with
> >>>>>         a bigger output buffer. However in decompression case, data up to produced
> >>>>>         in dst buffer may be inspected/searched. Never happens in compression
> >>>>>         case as output data would be meaningless.
> >>>>>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
> >>>>>         can convert to stateful, using op.produced and continuing from consumed+1.
> >>>>> I don't expect our PMDs to use this last case, but maybe this works for others?
> >>>>> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
> >>>>> without a stream, and maybe less efficient?
> >>>>> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
> >>>>> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
> >>>>> simply have submitted a STATEFUL request if this is the behaviour it wants?
> >>>> [Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
> >>>> and replacing it with
> >>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
> >>>>        Error condition, no recovery possible.
> >>>>        - consumed=0 produced=amount of data produced. Application must
> >>>> resubmit all input data with
> >>>>          a bigger output buffer to process all of the op
> >>>> OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
> >>>> an error. Op.produced
> >>>>    can be used and next op in stream should continue on from op.consumed+1.
> >>>>        -  consumed > 0, produced > 0. PMD has stored relevant state and
> >>>> history and so
> >>>>            can continue using op.produced and continuing from consumed+1.
> >>>>
> >>>> We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
> >>>> implementation either.
> >>>>
> >>>> Regardless of speculative future PMDs. The more important aspect of this
> >>>> for today is that the return status clearly determines
> >>>> the meaning of "consumed". If it is RECOVERABLE then consumed is
> >>>> meaningful. if it is TERMINATED then consumed in meaningless.
> >>>> This way we take away the ambiguity of having OUT_OF_SPACE mean two
> >>>> different user work flows.
> >>>>
> >>>> A speculative future PMD may be designed to return RECOVERABLE for
> >>>> stateless ops that are attached to streams.
> >>>> A future PMD may look to see if an op has a stream is attached and write
> >>>> out the state there and go into recoverable mode.
> >>>> in essence this leaves the choice up to the implementation and allows
> >>>> the PMD to take advantage of stateless optimizations
> >>>> so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
> >>>> context as soon as it fully processes an op. It will only
> >>>> write context out in cases where the op chokes.
> >>>> This futuristic PMD should ignore the FLUSH since this STATELESS mode as
> >>>> indicated by the user and optimize
> >>> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
> >>> definition as you mentioned and seem doable.
> >>> So then it mean all following conditions:
> >>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e.
> user
> >>> has to start all over again, it's a failure (as in current definition)
> >>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
> >>> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will
> maintain
> >>> states in stream pointer)
> >>> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
> >>> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature
> flag
> >>> enabled or not
> >> [Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of
> space
> >> should ignore the flush flag.
> >> Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
> >>  Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
> >>  out-of space, but unsupported or similar. Or it would fail on stream creation.
> >[Ahmed] Agreed with Fiona. The flush flag only matters on success. By
> >definition the PMD should return OUT_OF_SPACE_RECOVERABLE in stateful
> >mode when it runs out of space.
> >@Shally If the user did not provide a stream, then the PMD should
> >probably return TERMINATED every time. I am not sure we should make a
> >"really smart" PMD which returns RECOVERABLE even if no stream pointer
> >was given. In that case the PMD must give some ID back to the caller
> >that the caller can use to "recover" the op. I am not sure how it would
> >be implemented in the PMD and when does the PMD decide to retire streams
> >belonging to dead ops that the caller decided not to "recover".
> >>
> >>> and one more exception case is:
> >>> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
> >>> internally maintained that state somehow and consumed & produced > 0, so user can start
> consumed+1
> >>> but there's restriction on user not to alter or change op until it is fully processed?!
> >> [Fiona] Why the need for this case?
> >> There's always a restriction on user not to alter or change op until it is fully processed.
> >> If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?
> >[Ahmed] Agreed. The user should not touch an op once enqueued until they
> >receive it in dequeue. We ignore the flush in stateless mode. We assume
> >it to be final every time.
> 
> [Shally] Agreed and am not in favour of supporting such implementation either. Just listed out different
> possibilities up here to better visualise Ahmed requirements/applicability of TERMINATED and
> RECOVERABLE.
> 
> >>
> >>> API currently takes care of case a and c, and case b can be supported if specification accept another
> >>> proposal which mention optional usage of stream with stateless.
> >> [Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type
> >> parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that
> >> stream must be attached to ops.
> >>
> >>  Until then API takes no difference to
> >>> case b and c i.e. we can have op such as,
> >>> - type= stateful with flush = full/final, stream pointer provided, PMD can return
> >>> TERMINATED/RECOVERABLE according to its ability
> >>>
> >>> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
> >>> doable with concept of different return code.
> >>>
> >> [Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
> >> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
> >>     has to start all over again, it's a failure (as in current definition).
> >>     consumed = 0, produced=amount of data produced. This is usually 0, but in decompression
> >>     case a PMD may return > 0 and application may find it useful to inspect that data.
> >> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
> >>     TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will
> maintain
> >>     states in stream pointer)
> >> c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
> >>     op.produced can be used and next op in stream should continue on from op.consumed+1.
> >>     Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
> >>     no need to change state to TERMINATED in this case. There may be useful state/history
> >>     stored in the PMD, even though no output produced yet.
> >[Ahmed] Agreed
> [Shally] Sounds good.
> 
> >>
> >>>>>>>>>>> D.2 Compression API Stateful operation
> >>>>>>>>>>> ----------------------------------------------------------
> >>>>>>>>>>>  A Stateful operation in DPDK compression means application invokes
> >>>>>>>>>> enqueue burst() multiple times to process related chunk of data either
> >>>>>>>>>> because
> >>>>>>>>>>> - Application broke data into several ops, and/or
> >>>>>>>>>>> - PMD ran into out_of_space situation during input processing
> >>>>>>>>>>>
> >>>>>>>>>>> In case of either one or all of the above conditions, PMD is required to
> >>>>>>>>>> maintain state of op across enque_burst() calls and
> >>>>>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> >>>>>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> >>>>>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
> >>>>>>>>>>> D.2.1 Stateful operation state maintenance
> >>>>>>>>>>> ---------------------------------------------------------------
> >>>>>>>>>>> It is always an ideal expectation from application that it should parse
> >>>>>>>>>> through all related chunk of source data making its mbuf-chain and
> >>>>>>>> enqueue
> >>>>>>>>>> it for stateless processing.
> >>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
> >>>>>>>> an
> >>>>>>>>>> expected call flow would be something like:
> >>>>>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> >>>>>>>>>> burst in a loop until all ops are received. Is this correct?
> >>>>>>>>>>
> >>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >>>>>>>> this illustration is specifically in
> >>>>>>>>> context of stateful op processing to reflect if a stream is broken into
> >>>>>>>> chunks, then each chunk should be
> >>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
> >>>>>>>> dequeued first before next chunk is
> >>>>>>>>> enqueued.
> >>>>>>>>>
> >>>>>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>>>>> enqueue_burst( |op.full_flush |)
> >>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
> >>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >>>>>>>> distinguish
> >>>>>>>>>> the response in exception cases?
> >>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
> >>>>>>>> such case is independent of
> >>>>>>>>> each other i.e. belong to different streams altogether.
> >>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >>>>>>>> related chunks of data in single
> >>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
> >>>>>>>> PMD handling for various
> >>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
> >>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> >>>>>>>> time, since each needs the
> >>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
> >>>>>>>> force PMDs to implement internal queueing and exception handling for
> >>>>>>>> OUT_OF_SPACE conditions you mention.
> >>>>>> [Ahmed] But we are putting the ops on qps which would make them
> >>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> >>>>>> complex but doable.
> >>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
> >>>>> There may be many streams.
> >>>>> The PMD would have to have an internal queue per stream so
> >>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
> >>>>> And this may ripple back though all subsequent ops in the stream as each
> >>>>> source len is increased and its dst buffer is not big enough.
> >>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
> >>>> The caller would still need to adjust
> >>>> the src length/output buffer as you say. The PMD cannot handle
> >>>> OUT_OF_SPACE internally.
> >>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
> >>>> until it gets explicit
> >>>> confirmation from the caller to continue working on this stream. Any ops
> >>>> received by
> >>>> the PMD should be returned to the caller with status STREAM_PAUSED since
> >>>> the caller did not
> >>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
> >>>> These semantics can be enabled by adding a new function to the API
> >>>> perhaps stream_resume().
> >>>> This allows the caller to indicate that it acknowledges that it has seen
> >>>> the issue and this op
> >>>> should be used to resolve the issue. Implementations that do not support
> >>>> this mode of use
> >>>> can push back immediately after one op is in flight. Implementations
> >>>> that support this use
> >>>> mode can allow many ops from the same session
> >>>>
> >>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would
> still
> >>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think
> is
> >>> the case with DPDK PMD).
> >>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above
> PMD
> >>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output
> and
> >>> pass than as one op.
> >>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
> >>> impacting aspect also?
> >>>
> >>> if it is in context where each op belong to different stream in a burst, then why do we need
> >>> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed
> + 1
> >>> from next call onwards as it has already
> >>> seen OUT_OF_SPACE.
> >[Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
> >implementation rejects all ops that belong to a stream that has entered
> >"RECOVERABLE" state for one reason or another. The caller must
> >acknowledge explicitly that it has received news of the problem before
> >the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
> >that implementing this functionality in the software layer above the PMD
> >is a bad idea since the latency reductions are lost.
> 
> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity in a
> layer above PMD.
> 
> >This setup is useful in latency sensitive applications where the latency
> >of buffering multiple ops into one op is significant. We found latency
> >makes a significant difference in search applications where the PMD
> >competes with software decompression.
[Fiona] I see, so when all goes well, you get best-case latency, but when 
out-of-space occurs latency will probably be worse.

> >> [Fiona] I still have concerns with this and would not want to support in our PMD.
> >> TO make sure I understand, you want to send a burst of ops, with several from same stream.
> >> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
> >> subsequent ops in that stream.
> >> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
> >> Or somehow drop them? How?
> >> While still processing ops form other streams.
> >[Ahmed] This is exactly correct. It should return them with
> >NOT_PROCESSED. Yes, the PMD should continue processing other streams.
> >> As we want to offload each op to hardware with as little CPU processing as possible we
> >> would not want to open up each op to see which stream it's attached to and
> >> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.
> >[Ahmed] I think I might have missed your point here, but I will try to
> >answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
> >to the PMD and the PMD should reject until stream_continue() is called.
> >The next op to be sent by the user will have a special marker in it to
> >inform the PMD to continue working on this stream. Alternatively the
> >DPDK layer can be made "smarter" to fail during the enqueue by checking
> >the stream and its state, but like you say this adds additional CPU
> >overhead during the enqueue.
> >I am curious. In a simple synchronous use case. How do we prevent users
> >from putting multiple ops in flight that belong to a single stream? Do
> >we just currently say it is undefined behavior? Otherwise we would have
> >to check the stream and incur the CPU overhead.
[Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
DPDK model we expect good behaviour and don't have to error check for things like this.

In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then 
build and send those messages. If we found an op from a stream which already
had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
only send 19 to hw. We cannot send multiple ops from same stream to
the hw as it fans them out and does them in parallel.
Once the enqueue_burst() returns, there is no processing 
context which would spot that the first has completed
and send the next op to the hw. On a dequeue_burst() we would spot this, 
in that context could process the next op in the stream.
On out of space, instead of processing the next op we would have to transfer
all unprocessed ops from the stream to the dequeue result.
Some parts of this are doable, but seems likely to add a lot more latency, 
we'd need to add extra threads and timers to move ops from the sw
queue to the hw q to get any benefit, and these constructs would add 
context switching and CPU cycles. So we prefer to push this responsibility
to above the API and it can achieve similar.

 

> >>
> >> Maybe we could add a capability if this behaviour is important for you?
> >> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
> >> Our PMD would set this to 0. And expect no more than one op from a stateful stream
> >> to be in flight at any time.
> >[Ahmed] That makes sense. This way the different DPDK implementations do
> >not have to add extra checking for unsupported cases.
> 
> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please note,
> here burst is not carrying more than one stream)
> -PMD will submit one op at a time to HW?
> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue count
> = total # of ops submitted originally with burst?
> -app assumes all have been enqueued, so it go and dequeue all ops
> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
> NOT_PROCESSED with updated input and output buffer?
> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
> failure is seen, then app start whole processing all over again or just drop this burst?!
> 
> If all of above is true, then I think we should add another API such as rte_comp_enque_single_stream()
> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
[Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one stream
Or get how this makes a difference? As there can be many enqueue_burst() calls done before an dequeue_burst() 
Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the ops
had been processed? This would turn it into a synchronous call which isn't the intent.

> 
> 
> >>
> >>
> >>>> Regarding the ordering of ops
> >>>> We do force serialization of ops belonging to a stream in STATEFUL
> >>>> operation. Related ops do
> >>>> not go out of order and are given to available PMDs one at a time.
> >>>>
> >>>>>> The question is this mode of use useful for real
> >>>>>> life applications or would we be just adding complexity? The technical
> >>>>>> advantage of this is that processing of Stateful ops is interdependent
> >>>>>> and PMDs can take advantage of caching and other optimizations to make
> >>>>>> processing related ops much faster than switching on every op. PMDs have
> >>>>>> maintain state of more than 32 KB for DEFLATE for every stream.
> >>>>>>>> If the application has all the data, it can put it into chained mbufs in a single
> >>>>>>>> op rather than
> >>>>>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
> >>>>>> [Ahmed] I think that your suggested scheme of putting all related mbufs
> >>>>>> into one op may be the best solution without the extra complexity of
> >>>>>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
> >>>>>> time If we have a way of marking mbufs as ready for consumption. The
> >>>>>> enqueuer may not have all the data at hand but can enqueue the op with a
> >>>>>> couple of empty mbus marked as not ready for consumption. The enqueuer
> >>>>>> will then update the rest of the mbufs to ready for consumption once the
> >>>>>> data is added. This introduces a race condition. A second flag for each
> >>>>>> mbuf can be updated by the PMD to indicate that it processed it or not.
> >>>>>> This way in cases where the PMD beat the application to the op, the
> >>>>>> application will just update the op to point to the first unprocessed
> >>>>>> mbuf and resend it to the PMD.
> >>>>> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
> >>>>> enqueued the op. You would have to write to op.src.length at a time when the PMD
> >>>>> might be reading it. Sounds like a lock would be necessary.
> >>>>> Once the op has been enqueued, my understanding is its ownership is handed
> >>>>> over to the PMD and the application should not touch it until it has been dequeued.
> >>>>> I don't think it's a good idea to change this model.
> >>>>> Can't the application just collect a stream of data in chained mbufs until it has
> >>>>> enough to send an op, then construct the op and while waiting for that op to
> >>>>> complete, accumulate the next batch of chained mbufs? Only construct the next op
> >>>>> after the previous one is complete, based on the result of the previous one.
> >>>>>
> >>>> [Ahmed] Fair enough. I agree with you. I imagined it in a different way
> >>>> in which each mbuf would have its own length.
> >>>> The advantage to gain is in applications where there is one PMD user,
> >>>> the down time between ops can be significant and setting up a single
> >>>> producer consumer pair significantly reduces the CPU cycles and PMD down
> >>>> time.
> >>>>
> >>>> ////snip////
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-16 13:04                       ` Trahe, Fiona
@ 2018-02-16 21:21                         ` Ahmed Mansour
  2018-02-20  9:58                           ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-16 21:21 UTC (permalink / raw)
  To: Trahe, Fiona, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

>> -----Original Message-----
>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>> Sent: Friday, February 16, 2018 7:17 AM
>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
>> dev@dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>> Subject: RE: [RFC v2] doc compression API for DPDK
>>
>> Hi Fiona, Ahmed
>>
>>> -----Original Message-----
>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>> Sent: 16 February 2018 02:40
>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
>> Mahipal
>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
>> <hemant.agrawal@nxp.com>; Roy
>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>
>>> On 2/15/2018 1:47 PM, Trahe, Fiona wrote:
>>>> Hi Shally, Ahmed,
>>>> Sorry for the delay in replying,
>>>> Comments below
>>>>
>>>>> -----Original Message-----
>>>>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>>>>> Sent: Wednesday, February 14, 2018 7:41 AM
>>>>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
>>>>> dev@dpdk.org
>>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>>>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>>> Subject: RE: [RFC v2] doc compression API for DPDK
>>>>>
>>>>> Hi Ahmed,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>>>>> Sent: 02 February 2018 01:53
>>>>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>;
>> dev@dpdk.org
>>>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>>>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
>>>>> Mahipal
>>>>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
>>>>> <hemant.agrawal@nxp.com>; Roy
>>>>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>>>>
>>>>>> On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
>>>>>>> Hi Ahmed, Shally,
>>>>>>>
>>>>>>> ///snip///
>>>>>>>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>>>>>>>>>> and
>>>>>>>>>>>> where PMD still has more data to produce. If PMD run into such
>>>>>>>>>> condition,
>>>>>>>>>>>> then it's an error condition in stateless processing.
>>>>>>>>>>>>> In such case, PMD resets itself and return with status
>>>>>>>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>>>>>>>>>> i.e.
>>>>>>>>>>>> no input read, no output written.
>>>>>>>>>>>>> Application can resubmit an full input with larger output buffer size.
>>>>>>>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
>>>>>>>>>> was
>>>>>>>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>>>>>>>>>> decompression applications doing search.
>>>>>>>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
>>>>>>>>>> handling out_of_space under
>>>>>>>>>>> "Stateful Section").
>>>>>>>>>>> By definition, "stateless" here means that application (such as IPCOMP)
>>>>>>>>>> knows maximum output size
>>>>>>>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
>>>>>>>>>> than provided output buffer.
>>>>>>>>>>> Such apps can submit an op with type = STATELESS and provide full input,
>>>>>>>>>> then PMD assume it has
>>>>>>>>>>> sufficient input and output and thus doesn't need to maintain any contexts
>>>>>>>>>> after op is processed.
>>>>>>>>>>> If application doesn't know about max output size, then it should process it
>>>>>>>>>> as stateful op i.e. setup op
>>>>>>>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
>>>>>>>>>> relevant context to handle such
>>>>>>>>>>> condition.
>>>>>>>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>>>>>>>>>> respecting the stateless concept.
>>>>>>>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>>>>>>>>>> case
>>>>>>>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
>>>>>>>>>> amount of valid data which has
>>>>>>>>>>  been written to the output buffer. It is not complete, but if an application
>>>>>>>>>> wants to search it it may be sufficient.
>>>>>>>>>> If the application still wants the data it must resubmit the whole input with a
>>>>>>>>>> bigger output buffer, and
>>>>>>>>>>  decompression will be repeated from the start, it
>>>>>>>>>>  cannot expect to continue on as the PMD has not maintained state, history
>>>>>>>>>> or data.
>>>>>>>>>> I don't think there would be any need to indicate this in capabilities, PMDs
>>>>>>>>>> which cannot provide this
>>>>>>>>>> functionality would always return produced=consumed=0, while PMDs which
>>>>>>>>>> can could set produced > 0.
>>>>>>>>>> If this works for you both, we could consider a similar case for compression.
>>>>>>>>>>
>>>>>>>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to
>> actual
>>>>>>>> consumed by PMD.
>>>>>>>>> Setting consumed = 0 with produced > 0 doesn't correlate.
>>>>>>>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>>>>>>>> of returning consumed = 0. At the same time returning consumed = y
>>>>>>>> implies to the user that it can proceed from the middle. I prefer the
>>>>>>>> consumed = 0 implementation, but I think a different return is needed to
>>>>>>>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>>>>>>>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>>>>>>>> future PMD implementations to provide recover-ability even in STATELESS
>>>>>>>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>>>>>>>> hint for the PMD implementation to make optimizations for each case, but
>>>>>>>> it does not force the PMD implementation to limit functionality if it
>>>>>>>> can provide recover-ability.
>>>>>>> [Fiona] So you're suggesting the following:
>>>>>>> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>>>>>>>     can be used and next op in stream should continue on from op.consumed+1.
>>>>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>>>>     Error condition, no recovery possible.
>>>>>>>     consumed=produced=0. Application must resubmit all input data with
>>>>>>>     a bigger output buffer.
>>>>>>> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>>>>>>>      - consumed = 0, produced > 0. Application must resubmit all input data with
>>>>>>>         a bigger output buffer. However in decompression case, data up to produced
>>>>>>>         in dst buffer may be inspected/searched. Never happens in compression
>>>>>>>         case as output data would be meaningless.
>>>>>>>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>>>>>>>         can convert to stateful, using op.produced and continuing from consumed+1.
>>>>>>> I don't expect our PMDs to use this last case, but maybe this works for others?
>>>>>>> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
>>>>>>> without a stream, and maybe less efficient?
>>>>>>> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
>>>>>>> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
>>>>>>> simply have submitted a STATEFUL request if this is the behaviour it wants?
>>>>>> [Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
>>>>>> and replacing it with
>>>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>>>        Error condition, no recovery possible.
>>>>>>        - consumed=0 produced=amount of data produced. Application must
>>>>>> resubmit all input data with
>>>>>>          a bigger output buffer to process all of the op
>>>>>> OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
>>>>>> an error. Op.produced
>>>>>>    can be used and next op in stream should continue on from op.consumed+1.
>>>>>>        -  consumed > 0, produced > 0. PMD has stored relevant state and
>>>>>> history and so
>>>>>>            can continue using op.produced and continuing from consumed+1.
>>>>>>
>>>>>> We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
>>>>>> implementation either.
>>>>>>
>>>>>> Regardless of speculative future PMDs. The more important aspect of this
>>>>>> for today is that the return status clearly determines
>>>>>> the meaning of "consumed". If it is RECOVERABLE then consumed is
>>>>>> meaningful. if it is TERMINATED then consumed in meaningless.
>>>>>> This way we take away the ambiguity of having OUT_OF_SPACE mean two
>>>>>> different user work flows.
>>>>>>
>>>>>> A speculative future PMD may be designed to return RECOVERABLE for
>>>>>> stateless ops that are attached to streams.
>>>>>> A future PMD may look to see if an op has a stream is attached and write
>>>>>> out the state there and go into recoverable mode.
>>>>>> in essence this leaves the choice up to the implementation and allows
>>>>>> the PMD to take advantage of stateless optimizations
>>>>>> so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
>>>>>> context as soon as it fully processes an op. It will only
>>>>>> write context out in cases where the op chokes.
>>>>>> This futuristic PMD should ignore the FLUSH since this STATELESS mode as
>>>>>> indicated by the user and optimize
>>>>> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
>>>>> definition as you mentioned and seem doable.
>>>>> So then it mean all following conditions:
>>>>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e.
>> user
>>>>> has to start all over again, it's a failure (as in current definition)
>>>>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>>>>> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will
>> maintain
>>>>> states in stream pointer)
>>>>> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
>>>>> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature
>> flag
>>>>> enabled or not
>>>> [Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of
>> space
>>>> should ignore the flush flag.
>>>> Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
>>>>  Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
>>>>  out-of space, but unsupported or similar. Or it would fail on stream creation.
>>> [Ahmed] Agreed with Fiona. The flush flag only matters on success. By
>>> definition the PMD should return OUT_OF_SPACE_RECOVERABLE in stateful
>>> mode when it runs out of space.
>>> @Shally If the user did not provide a stream, then the PMD should
>>> probably return TERMINATED every time. I am not sure we should make a
>>> "really smart" PMD which returns RECOVERABLE even if no stream pointer
>>> was given. In that case the PMD must give some ID back to the caller
>>> that the caller can use to "recover" the op. I am not sure how it would
>>> be implemented in the PMD and when does the PMD decide to retire streams
>>> belonging to dead ops that the caller decided not to "recover".
>>>>> and one more exception case is:
>>>>> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
>>>>> internally maintained that state somehow and consumed & produced > 0, so user can start
>> consumed+1
>>>>> but there's restriction on user not to alter or change op until it is fully processed?!
>>>> [Fiona] Why the need for this case?
>>>> There's always a restriction on user not to alter or change op until it is fully processed.
>>>> If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?
>>> [Ahmed] Agreed. The user should not touch an op once enqueued until they
>>> receive it in dequeue. We ignore the flush in stateless mode. We assume
>>> it to be final every time.
>> [Shally] Agreed and am not in favour of supporting such implementation either. Just listed out different
>> possibilities up here to better visualise Ahmed requirements/applicability of TERMINATED and
>> RECOVERABLE.
>>
>>>>> API currently takes care of case a and c, and case b can be supported if specification accept another
>>>>> proposal which mention optional usage of stream with stateless.
>>>> [Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type
>>>> parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that
>>>> stream must be attached to ops.
>>>>
>>>>  Until then API takes no difference to
>>>>> case b and c i.e. we can have op such as,
>>>>> - type= stateful with flush = full/final, stream pointer provided, PMD can return
>>>>> TERMINATED/RECOVERABLE according to its ability
>>>>>
>>>>> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
>>>>> doable with concept of different return code.
>>>>>
>>>> [Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
>>>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
>>>>     has to start all over again, it's a failure (as in current definition).
>>>>     consumed = 0, produced=amount of data produced. This is usually 0, but in decompression
>>>>     case a PMD may return > 0 and application may find it useful to inspect that data.
>>>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>>>>     TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will
>> maintain
>>>>     states in stream pointer)
>>>> c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
>>>>     op.produced can be used and next op in stream should continue on from op.consumed+1.
>>>>     Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
>>>>     no need to change state to TERMINATED in this case. There may be useful state/history
>>>>     stored in the PMD, even though no output produced yet.
>>> [Ahmed] Agreed
>> [Shally] Sounds good.
>>
>>>>>>>>>>>>> D.2 Compression API Stateful operation
>>>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>>>  A Stateful operation in DPDK compression means application invokes
>>>>>>>>>>>> enqueue burst() multiple times to process related chunk of data either
>>>>>>>>>>>> because
>>>>>>>>>>>>> - Application broke data into several ops, and/or
>>>>>>>>>>>>> - PMD ran into out_of_space situation during input processing
>>>>>>>>>>>>>
>>>>>>>>>>>>> In case of either one or all of the above conditions, PMD is required to
>>>>>>>>>>>> maintain state of op across enque_burst() calls and
>>>>>>>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>>>>>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>>>>>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>> enqueue
>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>> an
>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>
>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>> enqueued.
>>>>>>>>>>>
>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>> distinguish
>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>> such case is independent of
>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>> related chunks of data in single
>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>> PMD handling for various
>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>> time, since each needs the
>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>> complex but doable.
>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>> There may be many streams.
>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>> The caller would still need to adjust
>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>> OUT_OF_SPACE internally.
>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>> until it gets explicit
>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>> received by
>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>> the caller did not
>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>> perhaps stream_resume().
>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>> the issue and this op
>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>> this mode of use
>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>> that support this use
>>>>>> mode can allow many ops from the same session
>>>>>>
>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would
>> still
>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think
>> is
>>>>> the case with DPDK PMD).
>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above
>> PMD
>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output
>> and
>>>>> pass than as one op.
>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>> impacting aspect also?
>>>>>
>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed
>> + 1
>>>>> from next call onwards as it has already
>>>>> seen OUT_OF_SPACE.
>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>> implementation rejects all ops that belong to a stream that has entered
>>> "RECOVERABLE" state for one reason or another. The caller must
>>> acknowledge explicitly that it has received news of the problem before
>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>> that implementing this functionality in the software layer above the PMD
>>> is a bad idea since the latency reductions are lost.
>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity in a
>> layer above PMD.
>>
>>> This setup is useful in latency sensitive applications where the latency
>>> of buffering multiple ops into one op is significant. We found latency
>>> makes a significant difference in search applications where the PMD
>>> competes with software decompression.
> [Fiona] I see, so when all goes well, you get best-case latency, but when 
> out-of-space occurs latency will probably be worse.
[Ahmed] This is exactly right. This use mode assumes out-of-space is a
rare occurrence. Recovering from it should take similar time to
synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
both sync and async use. The caller can fix up the op and send it back
to the PMD to continue work just as would be done in sync. Nonetheless,
the added complexity is not justifiable if out-of-space is very common
since the recoverable state will be the limiting factor that forces
synchronicity.
>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>> subsequent ops in that stream.
>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>> Or somehow drop them? How?
>>>> While still processing ops form other streams.
>>> [Ahmed] This is exactly correct. It should return them with
>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>> would not want to open up each op to see which stream it's attached to and
>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.
>>> [Ahmed] I think I might have missed your point here, but I will try to
>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>> to the PMD and the PMD should reject until stream_continue() is called.
>>> The next op to be sent by the user will have a special marker in it to
>>> inform the PMD to continue working on this stream. Alternatively the
>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>> the stream and its state, but like you say this adds additional CPU
>>> overhead during the enqueue.
>>> I am curious. In a simple synchronous use case. How do we prevent users
>> >from putting multiple ops in flight that belong to a single stream? Do
>>> we just currently say it is undefined behavior? Otherwise we would have
>>> to check the stream and incur the CPU overhead.
> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
> DPDK model we expect good behaviour and don't have to error check for things like this.
[Ahmed] This makes sense. We also assume good behavior.
> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then 
> build and send those messages. If we found an op from a stream which already
> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
> only send 19 to hw. We cannot send multiple ops from same stream to
> the hw as it fans them out and does them in parallel.
> Once the enqueue_burst() returns, there is no processing 
> context which would spot that the first has completed
> and send the next op to the hw. On a dequeue_burst() we would spot this, 
> in that context could process the next op in the stream.
> On out of space, instead of processing the next op we would have to transfer
> all unprocessed ops from the stream to the dequeue result.
> Some parts of this are doable, but seems likely to add a lot more latency, 
> we'd need to add extra threads and timers to move ops from the sw
> queue to the hw q to get any benefit, and these constructs would add 
> context switching and CPU cycles. So we prefer to push this responsibility
> to above the API and it can achieve similar.
[Ahmed] I see what you mean. Our workflow is almost exactly the same
with our hardware, but the fanning out is done by the hardware based on
the stream and ops that belong to the same stream are never allowed to
go out of order. Otherwise the data would be corrupted. Likewise the
hardware is responsible for checking the state of the stream and
returning frames as NOT_PROCESSED to the software
>>>> Maybe we could add a capability if this behaviour is important for you?
>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>> to be in flight at any time.
>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>> not have to add extra checking for unsupported cases.
>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please note,
>> here burst is not carrying more than one stream)
[Ahmed] No. In this use case the caller sets up an op and enqueues a
single op. Then before the response comes back from the PMD the caller
enqueues a second op on the same stream.
>> -PMD will submit one op at a time to HW?
[Ahmed] I misunderstood what PMD means. I used it throughout to mean the
HW. I used DPDK to mean the software implementation that talks to the
hardware.
The software will submit all ops immediately. The hardware has to figure
out what to do with the ops depending on what stream they belong to.
>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue count
>> = total # of ops submitted originally with burst?
[Ahmed] This is exactly what I had in mind. all ops will be submitted to
the HW. The HW will put all of them on the completion queue with the
correct status exactly as you say.
>> -app assumes all have been enqueued, so it go and dequeue all ops
>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>> NOT_PROCESSED with updated input and output buffer?
[Ahmed] Correct this is what we do today in our proprietary API.
>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>> failure is seen, then app start whole processing all over again or just drop this burst?!
[Ahmed] The app has the choice on how to proceed. If the issue is
recoverable then the application can continue this stream from where it
stopped. if the failure is unrecoverable then the application should
first fix the problem and start from the beginning of the stream.
>> If all of above is true, then I think we should add another API such as rte_comp_enque_single_stream()
>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
[Ahmed] The main advantage in async use is lost if we force all related
ops to be in the same burst. if we do that, then we might as well merge
all the ops into one op. That would reduce the overhead.
The use mode I am proposing is only useful in cases where the data
becomes available after the first enqueue occurred. I want to allow the
caller to enqueue the second set of data as soon as it is available
regardless of whether or not the HW has already started working on the
first op inflight.
> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one stream
> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an dequeue_burst() 
> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the ops
> had been processed? This would turn it into a synchronous call which isn't the intent.
[Ahmed] Agreed, a blocking or even a buffering software layer that baby
sits the hardware does not fundamentally change the parameters of the
system as a whole. It just moves workflow management complexity down
into the DPDK software layer. Rather there are real latency and
throughput advantages (because of caching) that I want to expose.

/// snip ///


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-16 21:21                         ` Ahmed Mansour
@ 2018-02-20  9:58                           ` Verma, Shally
  2018-02-20 19:56                             ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-20  9:58 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry



>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>Sent: 17 February 2018 02:52
>To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>>> -----Original Message-----
>>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>>> Sent: Friday, February 16, 2018 7:17 AM
>>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
>>> dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: RE: [RFC v2] doc compression API for DPDK
>>>
>>> Hi Fiona, Ahmed
>>>
>>>> -----Original Message-----
>>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>>> Sent: 16 February 2018 02:40
>>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
>>> Mahipal
>>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
>>> <hemant.agrawal@nxp.com>; Roy
>>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>>
>>>> On 2/15/2018 1:47 PM, Trahe, Fiona wrote:
>>>>> Hi Shally, Ahmed,
>>>>> Sorry for the delay in replying,
>>>>> Comments below
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Verma, Shally [mailto:Shally.Verma@cavium.com]
>>>>>> Sent: Wednesday, February 14, 2018 7:41 AM
>>>>>> To: Ahmed Mansour <ahmed.mansour@nxp.com>; Trahe, Fiona <fiona.trahe@intel.com>;
>>>>>> dev@dpdk.org
>>>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>>>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>>>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>>>>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>>>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>>>> Subject: RE: [RFC v2] doc compression API for DPDK
>>>>>>
>>>>>> Hi Ahmed,
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>>>>>> Sent: 02 February 2018 01:53
>>>>>>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>;
>>> dev@dpdk.org
>>>>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>>>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila
>>>>>>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa,
>>>>>> Mahipal
>>>>>>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal
>>>>>> <hemant.agrawal@nxp.com>; Roy
>>>>>>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>>>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>>>>>
>>>>>>> On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
>>>>>>>> Hi Ahmed, Shally,
>>>>>>>>
>>>>>>>> ///snip///
>>>>>>>>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
>>>>>>>>>>> and
>>>>>>>>>>>>> where PMD still has more data to produce. If PMD run into such
>>>>>>>>>>> condition,
>>>>>>>>>>>>> then it's an error condition in stateless processing.
>>>>>>>>>>>>>> In such case, PMD resets itself and return with status
>>>>>>>>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
>>>>>>>>>>> i.e.
>>>>>>>>>>>>> no input read, no output written.
>>>>>>>>>>>>>> Application can resubmit an full input with larger output buffer size.
>>>>>>>>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
>>>>>>>>>>> was
>>>>>>>>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
>>>>>>>>>>>>> decompression applications doing search.
>>>>>>>>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
>>>>>>>>>>> handling out_of_space under
>>>>>>>>>>>> "Stateful Section").
>>>>>>>>>>>> By definition, "stateless" here means that application (such as IPCOMP)
>>>>>>>>>>> knows maximum output size
>>>>>>>>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
>>>>>>>>>>> than provided output buffer.
>>>>>>>>>>>> Such apps can submit an op with type = STATELESS and provide full input,
>>>>>>>>>>> then PMD assume it has
>>>>>>>>>>>> sufficient input and output and thus doesn't need to maintain any contexts
>>>>>>>>>>> after op is processed.
>>>>>>>>>>>> If application doesn't know about max output size, then it should process it
>>>>>>>>>>> as stateful op i.e. setup op
>>>>>>>>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
>>>>>>>>>>> relevant context to handle such
>>>>>>>>>>>> condition.
>>>>>>>>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
>>>>>>>>>>> respecting the stateless concept.
>>>>>>>>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
>>>>>>>>>>> case
>>>>>>>>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
>>>>>>>>>>> amount of valid data which has
>>>>>>>>>>>  been written to the output buffer. It is not complete, but if an application
>>>>>>>>>>> wants to search it it may be sufficient.
>>>>>>>>>>> If the application still wants the data it must resubmit the whole input with a
>>>>>>>>>>> bigger output buffer, and
>>>>>>>>>>>  decompression will be repeated from the start, it
>>>>>>>>>>>  cannot expect to continue on as the PMD has not maintained state, history
>>>>>>>>>>> or data.
>>>>>>>>>>> I don't think there would be any need to indicate this in capabilities, PMDs
>>>>>>>>>>> which cannot provide this
>>>>>>>>>>> functionality would always return produced=consumed=0, while PMDs which
>>>>>>>>>>> can could set produced > 0.
>>>>>>>>>>> If this works for you both, we could consider a similar case for compression.
>>>>>>>>>>>
>>>>>>>>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to
>>> actual
>>>>>>>>> consumed by PMD.
>>>>>>>>>> Setting consumed = 0 with produced > 0 doesn't correlate.
>>>>>>>>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
>>>>>>>>> of returning consumed = 0. At the same time returning consumed = y
>>>>>>>>> implies to the user that it can proceed from the middle. I prefer the
>>>>>>>>> consumed = 0 implementation, but I think a different return is needed to
>>>>>>>>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
>>>>>>>>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
>>>>>>>>> future PMD implementations to provide recover-ability even in STATELESS
>>>>>>>>> mode if they so wish. In this model STATELESS or STATEFUL would be a
>>>>>>>>> hint for the PMD implementation to make optimizations for each case, but
>>>>>>>>> it does not force the PMD implementation to limit functionality if it
>>>>>>>>> can provide recover-ability.
>>>>>>>> [Fiona] So you're suggesting the following:
>>>>>>>> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
>>>>>>>>     can be used and next op in stream should continue on from op.consumed+1.
>>>>>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>>>>>     Error condition, no recovery possible.
>>>>>>>>     consumed=produced=0. Application must resubmit all input data with
>>>>>>>>     a bigger output buffer.
>>>>>>>> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
>>>>>>>>      - consumed = 0, produced > 0. Application must resubmit all input data with
>>>>>>>>         a bigger output buffer. However in decompression case, data up to produced
>>>>>>>>         in dst buffer may be inspected/searched. Never happens in compression
>>>>>>>>         case as output data would be meaningless.
>>>>>>>>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
>>>>>>>>         can convert to stateful, using op.produced and continuing from consumed+1.
>>>>>>>> I don't expect our PMDs to use this last case, but maybe this works for others?
>>>>>>>> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
>>>>>>>> without a stream, and maybe less efficient?
>>>>>>>> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
>>>>>>>> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
>>>>>>>> simply have submitted a STATEFUL request if this is the behaviour it wants?
>>>>>>> [Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
>>>>>>> and replacing it with
>>>>>>> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
>>>>>>>        Error condition, no recovery possible.
>>>>>>>        - consumed=0 produced=amount of data produced. Application must
>>>>>>> resubmit all input data with
>>>>>>>          a bigger output buffer to process all of the op
>>>>>>> OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
>>>>>>> an error. Op.produced
>>>>>>>    can be used and next op in stream should continue on from op.consumed+1.
>>>>>>>        -  consumed > 0, produced > 0. PMD has stored relevant state and
>>>>>>> history and so
>>>>>>>            can continue using op.produced and continuing from consumed+1.
>>>>>>>
>>>>>>> We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
>>>>>>> implementation either.
>>>>>>>
>>>>>>> Regardless of speculative future PMDs. The more important aspect of this
>>>>>>> for today is that the return status clearly determines
>>>>>>> the meaning of "consumed". If it is RECOVERABLE then consumed is
>>>>>>> meaningful. if it is TERMINATED then consumed in meaningless.
>>>>>>> This way we take away the ambiguity of having OUT_OF_SPACE mean two
>>>>>>> different user work flows.
>>>>>>>
>>>>>>> A speculative future PMD may be designed to return RECOVERABLE for
>>>>>>> stateless ops that are attached to streams.
>>>>>>> A future PMD may look to see if an op has a stream is attached and write
>>>>>>> out the state there and go into recoverable mode.
>>>>>>> in essence this leaves the choice up to the implementation and allows
>>>>>>> the PMD to take advantage of stateless optimizations
>>>>>>> so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
>>>>>>> context as soon as it fully processes an op. It will only
>>>>>>> write context out in cases where the op chokes.
>>>>>>> This futuristic PMD should ignore the FLUSH since this STATELESS mode as
>>>>>>> indicated by the user and optimize
>>>>>> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
>>>>>> definition as you mentioned and seem doable.
>>>>>> So then it mean all following conditions:
>>>>>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e.
>>> user
>>>>>> has to start all over again, it's a failure (as in current definition)
>>>>>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>>>>>> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will
>>> maintain
>>>>>> states in stream pointer)
>>>>>> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
>>>>>> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature
>>> flag
>>>>>> enabled or not
>>>>> [Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of
>>> space
>>>>> should ignore the flush flag.
>>>>> Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
>>>>>  Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
>>>>>  out-of space, but unsupported or similar. Or it would fail on stream creation.
>>>> [Ahmed] Agreed with Fiona. The flush flag only matters on success. By
>>>> definition the PMD should return OUT_OF_SPACE_RECOVERABLE in stateful
>>>> mode when it runs out of space.
>>>> @Shally If the user did not provide a stream, then the PMD should
>>>> probably return TERMINATED every time. I am not sure we should make a
>>>> "really smart" PMD which returns RECOVERABLE even if no stream pointer
>>>> was given. In that case the PMD must give some ID back to the caller
>>>> that the caller can use to "recover" the op. I am not sure how it would
>>>> be implemented in the PMD and when does the PMD decide to retire streams
>>>> belonging to dead ops that the caller decided not to "recover".
>>>>>> and one more exception case is:
>>>>>> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
>>>>>> internally maintained that state somehow and consumed & produced > 0, so user can start
>>> consumed+1
>>>>>> but there's restriction on user not to alter or change op until it is fully processed?!
>>>>> [Fiona] Why the need for this case?
>>>>> There's always a restriction on user not to alter or change op until it is fully processed.
>>>>> If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?
>>>> [Ahmed] Agreed. The user should not touch an op once enqueued until they
>>>> receive it in dequeue. We ignore the flush in stateless mode. We assume
>>>> it to be final every time.
>>> [Shally] Agreed and am not in favour of supporting such implementation either. Just listed out different
>>> possibilities up here to better visualise Ahmed requirements/applicability of TERMINATED and
>>> RECOVERABLE.
>>>
>>>>>> API currently takes care of case a and c, and case b can be supported if specification accept another
>>>>>> proposal which mention optional usage of stream with stateless.
>>>>> [Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type
>>>>> parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that
>>>>> stream must be attached to ops.
>>>>>
>>>>>  Until then API takes no difference to
>>>>>> case b and c i.e. we can have op such as,
>>>>>> - type= stateful with flush = full/final, stream pointer provided, PMD can return
>>>>>> TERMINATED/RECOVERABLE according to its ability
>>>>>>
>>>>>> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
>>>>>> doable with concept of different return code.
>>>>>>
>>>>> [Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
>>>>> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
>>>>>     has to start all over again, it's a failure (as in current definition).
>>>>>     consumed = 0, produced=amount of data produced. This is usually 0, but in decompression
>>>>>     case a PMD may return > 0 and application may find it useful to inspect that data.
>>>>> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
>>>>>     TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will
>>> maintain
>>>>>     states in stream pointer)
>>>>> c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
>>>>>     op.produced can be used and next op in stream should continue on from op.consumed+1.
>>>>>     Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
>>>>>     no need to change state to TERMINATED in this case. There may be useful state/history
>>>>>     stored in the PMD, even though no output produced yet.
>>>> [Ahmed] Agreed
>>> [Shally] Sounds good.
>>>
>>>>>>>>>>>>>> D.2 Compression API Stateful operation
>>>>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>>>>  A Stateful operation in DPDK compression means application invokes
>>>>>>>>>>>>> enqueue burst() multiple times to process related chunk of data either
>>>>>>>>>>>>> because
>>>>>>>>>>>>>> - Application broke data into several ops, and/or
>>>>>>>>>>>>>> - PMD ran into out_of_space situation during input processing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In case of either one or all of the above conditions, PMD is required to
>>>>>>>>>>>>> maintain state of op across enque_burst() calls and
>>>>>>>>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
>>>>>>>>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
>>>>>>>>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>>> enqueue
>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>>> an
>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>
>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>
>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>> distinguish
>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>>> such case is independent of
>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>>> time, since each needs the
>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>>> complex but doable.
>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>>> There may be many streams.
>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>> The caller would still need to adjust
>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>> OUT_OF_SPACE internally.
>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>>> until it gets explicit
>>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>>> received by
>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>>> the caller did not
>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>> perhaps stream_resume().
>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>>> the issue and this op
>>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>>> this mode of use
>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>> that support this use
>>>>>>> mode can allow many ops from the same session
>>>>>>>
>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would
>>> still
>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think
>>> is
>>>>>> the case with DPDK PMD).
>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above
>>> PMD
>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output
>>> and
>>>>>> pass than as one op.
>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>>> impacting aspect also?
>>>>>>
>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed
>>> + 1
>>>>>> from next call onwards as it has already
>>>>>> seen OUT_OF_SPACE.
>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>> implementation rejects all ops that belong to a stream that has entered
>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>> acknowledge explicitly that it has received news of the problem before
>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>>> that implementing this functionality in the software layer above the PMD
>>>> is a bad idea since the latency reductions are lost.
>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity in a
>>> layer above PMD.
>>>
>>>> This setup is useful in latency sensitive applications where the latency
>>>> of buffering multiple ops into one op is significant. We found latency
>>>> makes a significant difference in search applications where the PMD
>>>> competes with software decompression.
>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>> out-of-space occurs latency will probably be worse.
>[Ahmed] This is exactly right. This use mode assumes out-of-space is a
>rare occurrence. Recovering from it should take similar time to
>synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>both sync and async use. The caller can fix up the op and send it back
>to the PMD to continue work just as would be done in sync. Nonetheless,
>the added complexity is not justifiable if out-of-space is very common
>since the recoverable state will be the limiting factor that forces
>synchronicity.
>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>>> subsequent ops in that stream.
>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>>> Or somehow drop them? How?
>>>>> While still processing ops form other streams.
>>>> [Ahmed] This is exactly correct. It should return them with
>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>>> would not want to open up each op to see which stream it's attached to and
>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.
>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>> The next op to be sent by the user will have a special marker in it to
>>>> inform the PMD to continue working on this stream. Alternatively the
>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>> the stream and its state, but like you say this adds additional CPU
>>>> overhead during the enqueue.
>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>> to check the stream and incur the CPU overhead.
>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
>> DPDK model we expect good behaviour and don't have to error check for things like this.
>[Ahmed] This makes sense. We also assume good behavior.
>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
>> build and send those messages. If we found an op from a stream which already
>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
>> only send 19 to hw. We cannot send multiple ops from same stream to
>> the hw as it fans them out and does them in parallel.
>> Once the enqueue_burst() returns, there is no processing
>> context which would spot that the first has completed
>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>> in that context could process the next op in the stream.
>> On out of space, instead of processing the next op we would have to transfer
>> all unprocessed ops from the stream to the dequeue result.
>> Some parts of this are doable, but seems likely to add a lot more latency,
>> we'd need to add extra threads and timers to move ops from the sw
>> queue to the hw q to get any benefit, and these constructs would add
>> context switching and CPU cycles. So we prefer to push this responsibility
>> to above the API and it can achieve similar.
>[Ahmed] I see what you mean. Our workflow is almost exactly the same
>with our hardware, but the fanning out is done by the hardware based on
>the stream and ops that belong to the same stream are never allowed to
>go out of order. Otherwise the data would be corrupted. Likewise the
>hardware is responsible for checking the state of the stream and
>returning frames as NOT_PROCESSED to the software
>>>>> Maybe we could add a capability if this behaviour is important for you?
>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>>> to be in flight at any time.
>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>>> not have to add extra checking for unsupported cases.
>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please note,
>>> here burst is not carrying more than one stream)
>[Ahmed] No. In this use case the caller sets up an op and enqueues a
>single op. Then before the response comes back from the PMD the caller
>enqueues a second op on the same stream.
>>> -PMD will submit one op at a time to HW?
>[Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>HW. I used DPDK to mean the software implementation that talks to the
>hardware.
>The software will submit all ops immediately. The hardware has to figure
>out what to do with the ops depending on what stream they belong to.
>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue count
>>> = total # of ops submitted originally with burst?
>[Ahmed] This is exactly what I had in mind. all ops will be submitted to
>the HW. The HW will put all of them on the completion queue with the
>correct status exactly as you say.
>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>>> NOT_PROCESSED with updated input and output buffer?
>[Ahmed] Correct this is what we do today in our proprietary API.
>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>>> failure is seen, then app start whole processing all over again or just drop this burst?!
>[Ahmed] The app has the choice on how to proceed. If the issue is
>recoverable then the application can continue this stream from where it
>stopped. if the failure is unrecoverable then the application should
>first fix the problem and start from the beginning of the stream.
>>> If all of above is true, then I think we should add another API such as rte_comp_enque_single_stream()
>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>[Ahmed] The main advantage in async use is lost if we force all related
>ops to be in the same burst. if we do that, then we might as well merge
>all the ops into one op. That would reduce the overhead.
>The use mode I am proposing is only useful in cases where the data
>becomes available after the first enqueue occurred. I want to allow the
>caller to enqueue the second set of data as soon as it is available
>regardless of whether or not the HW has already started working on the
>first op inflight.

[Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
  
As per current description in doc, expected stateful usage is:
enqueue (op1) --> dequeue(op1) --> enqueue(op2)

but you're suggesting to allow an option to change it to 

enqueue(op1) -->enqueue(op2) 

i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will still carry multiple streams where each op belonging to different stream ?!
if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it  in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to enable it under some feature flag.

However it looks like an add-on and if it doesn't change current definition of a burst and minimum expectation set on stateful processing described in this document, then IMO, you can propose this feature as an incremental patch on baseline version, in absence of which, 
application will exercise stateful processing as described here (enq->deq->enq). Thoughts?


>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one stream
>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an dequeue_burst()
>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the ops
>> had been processed? This would turn it into a synchronous call which isn't the intent.
>[Ahmed] Agreed, a blocking or even a buffering software layer that baby
>sits the hardware does not fundamentally change the parameters of the
>system as a whole. It just moves workflow management complexity down
>into the DPDK software layer. Rather there are real latency and
>throughput advantages (because of caching) that I want to expose.
>

>/// snip ///

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-20  9:58                           ` Verma, Shally
@ 2018-02-20 19:56                             ` Ahmed Mansour
  2018-02-21 14:35                               ` Trahe, Fiona
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-20 19:56 UTC (permalink / raw)
  To: Verma, Shally, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

/// snip ///
>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>>>> enqueue
>>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>>>> an
>>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>>>> such case is independent of
>>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>>>> time, since each needs the
>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>>>> complex but doable.
>>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>>>> There may be many streams.
>>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>>> The caller would still need to adjust
>>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>>> OUT_OF_SPACE internally.
>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>>>> until it gets explicit
>>>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>>>> received by
>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>>>> the caller did not
>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>>> perhaps stream_resume().
>>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>>>> the issue and this op
>>>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>>>> this mode of use
>>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>>> that support this use
>>>>>>>> mode can allow many ops from the same session
>>>>>>>>
>>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would
>>>> still
>>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think
>>>> is
>>>>>>> the case with DPDK PMD).
>>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above
>>>> PMD
>>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output
>>>> and
>>>>>>> pass than as one op.
>>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>>>> impacting aspect also?
>>>>>>>
>>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed
>>>> + 1
>>>>>>> from next call onwards as it has already
>>>>>>> seen OUT_OF_SPACE.
>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>>> implementation rejects all ops that belong to a stream that has entered
>>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>>> acknowledge explicitly that it has received news of the problem before
>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>>>> that implementing this functionality in the software layer above the PMD
>>>>> is a bad idea since the latency reductions are lost.
>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity in a
>>>> layer above PMD.
>>>>
>>>>> This setup is useful in latency sensitive applications where the latency
>>>>> of buffering multiple ops into one op is significant. We found latency
>>>>> makes a significant difference in search applications where the PMD
>>>>> competes with software decompression.
>>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>>> out-of-space occurs latency will probably be worse.
>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
>> rare occurrence. Recovering from it should take similar time to
>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>> both sync and async use. The caller can fix up the op and send it back
>> to the PMD to continue work just as would be done in sync. Nonetheless,
>> the added complexity is not justifiable if out-of-space is very common
>> since the recoverable state will be the limiting factor that forces
>> synchronicity.
>>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>>>> subsequent ops in that stream.
>>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>>>> Or somehow drop them? How?
>>>>>> While still processing ops form other streams.
>>>>> [Ahmed] This is exactly correct. It should return them with
>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>>>> would not want to open up each op to see which stream it's attached to and
>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.
>>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>>> The next op to be sent by the user will have a special marker in it to
>>>>> inform the PMD to continue working on this stream. Alternatively the
>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>>> the stream and its state, but like you say this adds additional CPU
>>>>> overhead during the enqueue.
>>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>>> to check the stream and incur the CPU overhead.
>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
>>> DPDK model we expect good behaviour and don't have to error check for things like this.
>> [Ahmed] This makes sense. We also assume good behavior.
>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
>>> build and send those messages. If we found an op from a stream which already
>>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
>>> only send 19 to hw. We cannot send multiple ops from same stream to
>>> the hw as it fans them out and does them in parallel.
>>> Once the enqueue_burst() returns, there is no processing
>>> context which would spot that the first has completed
>>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>>> in that context could process the next op in the stream.
>>> On out of space, instead of processing the next op we would have to transfer
>>> all unprocessed ops from the stream to the dequeue result.
>>> Some parts of this are doable, but seems likely to add a lot more latency,
>>> we'd need to add extra threads and timers to move ops from the sw
>>> queue to the hw q to get any benefit, and these constructs would add
>>> context switching and CPU cycles. So we prefer to push this responsibility
>>> to above the API and it can achieve similar.
>> [Ahmed] I see what you mean. Our workflow is almost exactly the same
>> with our hardware, but the fanning out is done by the hardware based on
>> the stream and ops that belong to the same stream are never allowed to
>> go out of order. Otherwise the data would be corrupted. Likewise the
>> hardware is responsible for checking the state of the stream and
>> returning frames as NOT_PROCESSED to the software
>>>>>> Maybe we could add a capability if this behaviour is important for you?
>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>>>> to be in flight at any time.
>>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>>>> not have to add extra checking for unsupported cases.
>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please note,
>>>> here burst is not carrying more than one stream)
>> [Ahmed] No. In this use case the caller sets up an op and enqueues a
>> single op. Then before the response comes back from the PMD the caller
>> enqueues a second op on the same stream.
>>>> -PMD will submit one op at a time to HW?
>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>> HW. I used DPDK to mean the software implementation that talks to the
>> hardware.
>> The software will submit all ops immediately. The hardware has to figure
>> out what to do with the ops depending on what stream they belong to.
>>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue count
>>>> = total # of ops submitted originally with burst?
>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
>> the HW. The HW will put all of them on the completion queue with the
>> correct status exactly as you say.
>>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>>>> NOT_PROCESSED with updated input and output buffer?
>> [Ahmed] Correct this is what we do today in our proprietary API.
>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>>>> failure is seen, then app start whole processing all over again or just drop this burst?!
>> [Ahmed] The app has the choice on how to proceed. If the issue is
>> recoverable then the application can continue this stream from where it
>> stopped. if the failure is unrecoverable then the application should
>> first fix the problem and start from the beginning of the stream.
>>>> If all of above is true, then I think we should add another API such as rte_comp_enque_single_stream()
>>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>> [Ahmed] The main advantage in async use is lost if we force all related
>> ops to be in the same burst. if we do that, then we might as well merge
>> all the ops into one op. That would reduce the overhead.
>> The use mode I am proposing is only useful in cases where the data
>> becomes available after the first enqueue occurred. I want to allow the
>> caller to enqueue the second set of data as soon as it is available
>> regardless of whether or not the HW has already started working on the
>> first op inflight.
> [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
>   
> As per current description in doc, expected stateful usage is:
> enqueue (op1) --> dequeue(op1) --> enqueue(op2)
>
> but you're suggesting to allow an option to change it to 
>
> enqueue(op1) -->enqueue(op2) 
>
> i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will still carry multiple streams where each op belonging to different stream ?!
[Ahmed] Correct. I guess a user could put two ops on the same burst that
belong to the same stream. In that case it would be more efficient to
merge the ops using scatter gather. Nonetheless, I would not add checks
in my implementation to limit that use. The hardware does not perceive a
difference between ops that came on one burst and ops that came on two
different bursts. to the hardware they are all ops. What matters is
which stream each op belongs to.
> if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it  in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to enable it under some feature flag.
>
> However it looks like an add-on and if it doesn't change current definition of a burst and minimum expectation set on stateful processing described in this document, then IMO, you can propose this feature as an incremental patch on baseline version, in absence of which, 
> application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
[Ahmed] Makes sense. I was worried that there might be fundamental
limitations to this mode of use in the API design. That is why I wanted
to share this use mode with you guys and see if it can be accommodated
using an incremental patch in the future.
>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one stream
>>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an dequeue_burst()
>>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the ops
>>> had been processed? This would turn it into a synchronous call which isn't the intent.
>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
>> sits the hardware does not fundamentally change the parameters of the
>> system as a whole. It just moves workflow management complexity down
>> into the DPDK software layer. Rather there are real latency and
>> throughput advantages (because of caching) that I want to expose.
>>
>> /// snip ///
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-20 19:56                             ` Ahmed Mansour
@ 2018-02-21 14:35                               ` Trahe, Fiona
  2018-02-21 19:35                                 ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Trahe, Fiona @ 2018-02-21 14:35 UTC (permalink / raw)
  To: Ahmed Mansour, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry, Trahe, Fiona

Hi Ahmed, Shally,


> -----Original Message-----
> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
> Sent: Tuesday, February 20, 2018 7:56 PM
> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
> Subject: Re: [RFC v2] doc compression API for DPDK
> 
> /// snip ///
> >>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
> >>>>>>>>>>>>>>> ---------------------------------------------------------------
> >>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
> >>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
> >>>>>>>>>>>> enqueue
> >>>>>>>>>>>>>> it for stateless processing.
> >>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
> >>>>>>>>>>>> an
> >>>>>>>>>>>>>> expected call flow would be something like:
> >>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> >>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >>>>>>>>>>>> this illustration is specifically in
> >>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
> >>>>>>>>>>>> chunks, then each chunk should be
> >>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
> >>>>>>>>>>>> dequeued first before next chunk is
> >>>>>>>>>>>>> enqueued.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
> >>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
> >>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>> the response in exception cases?
> >>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
> >>>>>>>>>>>> such case is independent of
> >>>>>>>>>>>>> each other i.e. belong to different streams altogether.
> >>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >>>>>>>>>>>> related chunks of data in single
> >>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
> >>>>>>>>>>>> PMD handling for various
> >>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
> >>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> >>>>>>>>>>>> time, since each needs the
> >>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
> >>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
> >>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
> >>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
> >>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> >>>>>>>>>> complex but doable.
> >>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
> >>>>>>>>> There may be many streams.
> >>>>>>>>> The PMD would have to have an internal queue per stream so
> >>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
> >>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
> >>>>>>>>> source len is increased and its dst buffer is not big enough.
> >>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
> >>>>>>>> The caller would still need to adjust
> >>>>>>>> the src length/output buffer as you say. The PMD cannot handle
> >>>>>>>> OUT_OF_SPACE internally.
> >>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
> >>>>>>>> until it gets explicit
> >>>>>>>> confirmation from the caller to continue working on this stream. Any ops
> >>>>>>>> received by
> >>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
> >>>>>>>> the caller did not
> >>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
> >>>>>>>> These semantics can be enabled by adding a new function to the API
> >>>>>>>> perhaps stream_resume().
> >>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
> >>>>>>>> the issue and this op
> >>>>>>>> should be used to resolve the issue. Implementations that do not support
> >>>>>>>> this mode of use
> >>>>>>>> can push back immediately after one op is in flight. Implementations
> >>>>>>>> that support this use
> >>>>>>>> mode can allow many ops from the same session
> >>>>>>>>
> >>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I
> would
> >>>> still
> >>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I
> think
> >>>> is
> >>>>>>> the case with DPDK PMD).
> >>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer
> above
> >>>> PMD
> >>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and
> output
> >>>> and
> >>>>>>> pass than as one op.
> >>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
> >>>>>>> impacting aspect also?
> >>>>>>>
> >>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
> >>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with
> consumed
> >>>> + 1
> >>>>>>> from next call onwards as it has already
> >>>>>>> seen OUT_OF_SPACE.
> >>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
> >>>>> implementation rejects all ops that belong to a stream that has entered
> >>>>> "RECOVERABLE" state for one reason or another. The caller must
> >>>>> acknowledge explicitly that it has received news of the problem before
> >>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
> >>>>> that implementing this functionality in the software layer above the PMD
> >>>>> is a bad idea since the latency reductions are lost.
> >>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity
> in a
> >>>> layer above PMD.
> >>>>
> >>>>> This setup is useful in latency sensitive applications where the latency
> >>>>> of buffering multiple ops into one op is significant. We found latency
> >>>>> makes a significant difference in search applications where the PMD
> >>>>> competes with software decompression.
> >>> [Fiona] I see, so when all goes well, you get best-case latency, but when
> >>> out-of-space occurs latency will probably be worse.
> >> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
> >> rare occurrence. Recovering from it should take similar time to
> >> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
> >> both sync and async use. The caller can fix up the op and send it back
> >> to the PMD to continue work just as would be done in sync. Nonetheless,
> >> the added complexity is not justifiable if out-of-space is very common
> >> since the recoverable state will be the limiting factor that forces
> >> synchronicity.
> >>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
> >>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
> >>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
> >>>>>> subsequent ops in that stream.
> >>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
> >>>>>> Or somehow drop them? How?
> >>>>>> While still processing ops form other streams.
> >>>>> [Ahmed] This is exactly correct. It should return them with
> >>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
> >>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
> >>>>>> would not want to open up each op to see which stream it's attached to and
> >>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without
> processing.
> >>>>> [Ahmed] I think I might have missed your point here, but I will try to
> >>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
> >>>>> to the PMD and the PMD should reject until stream_continue() is called.
> >>>>> The next op to be sent by the user will have a special marker in it to
> >>>>> inform the PMD to continue working on this stream. Alternatively the
> >>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
> >>>>> the stream and its state, but like you say this adds additional CPU
> >>>>> overhead during the enqueue.
> >>>>> I am curious. In a simple synchronous use case. How do we prevent users
> >>>> >from putting multiple ops in flight that belong to a single stream? Do
> >>>>> we just currently say it is undefined behavior? Otherwise we would have
> >>>>> to check the stream and incur the CPU overhead.
> >>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
> >>> DPDK model we expect good behaviour and don't have to error check for things like this.
> >> [Ahmed] This makes sense. We also assume good behavior.
> >>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
> >>> build and send those messages. If we found an op from a stream which already
> >>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
> >>> only send 19 to hw. We cannot send multiple ops from same stream to
> >>> the hw as it fans them out and does them in parallel.
> >>> Once the enqueue_burst() returns, there is no processing
> >>> context which would spot that the first has completed
> >>> and send the next op to the hw. On a dequeue_burst() we would spot this,
> >>> in that context could process the next op in the stream.
> >>> On out of space, instead of processing the next op we would have to transfer
> >>> all unprocessed ops from the stream to the dequeue result.
> >>> Some parts of this are doable, but seems likely to add a lot more latency,
> >>> we'd need to add extra threads and timers to move ops from the sw
> >>> queue to the hw q to get any benefit, and these constructs would add
> >>> context switching and CPU cycles. So we prefer to push this responsibility
> >>> to above the API and it can achieve similar.
> >> [Ahmed] I see what you mean. Our workflow is almost exactly the same
> >> with our hardware, but the fanning out is done by the hardware based on
> >> the stream and ops that belong to the same stream are never allowed to
> >> go out of order. Otherwise the data would be corrupted. Likewise the
> >> hardware is responsible for checking the state of the stream and
> >> returning frames as NOT_PROCESSED to the software
> >>>>>> Maybe we could add a capability if this behaviour is important for you?
> >>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
> >>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
> >>>>>> to be in flight at any time.
> >>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
> >>>>> not have to add extra checking for unsupported cases.
> >>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
> >>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please
> note,
> >>>> here burst is not carrying more than one stream)
> >> [Ahmed] No. In this use case the caller sets up an op and enqueues a
> >> single op. Then before the response comes back from the PMD the caller
> >> enqueues a second op on the same stream.
> >>>> -PMD will submit one op at a time to HW?
> >> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
> >> HW. I used DPDK to mean the software implementation that talks to the
> >> hardware.
> >> The software will submit all ops immediately. The hardware has to figure
> >> out what to do with the ops depending on what stream they belong to.
> >>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
> >>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
> >>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue
> count
> >>>> = total # of ops submitted originally with burst?
> >> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
> >> the HW. The HW will put all of them on the completion queue with the
> >> correct status exactly as you say.
> >>>> -app assumes all have been enqueued, so it go and dequeue all ops
> >>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
> >>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
> >>>> NOT_PROCESSED with updated input and output buffer?
> >> [Ahmed] Correct this is what we do today in our proprietary API.
> >>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
> >>>> failure is seen, then app start whole processing all over again or just drop this burst?!
> >> [Ahmed] The app has the choice on how to proceed. If the issue is
> >> recoverable then the application can continue this stream from where it
> >> stopped. if the failure is unrecoverable then the application should
> >> first fix the problem and start from the beginning of the stream.
> >>>> If all of above is true, then I think we should add another API such as
> rte_comp_enque_single_stream()
> >>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
> >>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
> >> [Ahmed] The main advantage in async use is lost if we force all related
> >> ops to be in the same burst. if we do that, then we might as well merge
> >> all the ops into one op. That would reduce the overhead.
> >> The use mode I am proposing is only useful in cases where the data
> >> becomes available after the first enqueue occurred. I want to allow the
> >> caller to enqueue the second set of data as soon as it is available
> >> regardless of whether or not the HW has already started working on the
> >> first op inflight.
> > [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
> >
> > As per current description in doc, expected stateful usage is:
> > enqueue (op1) --> dequeue(op1) --> enqueue(op2)
> >
> > but you're suggesting to allow an option to change it to
> >
> > enqueue(op1) -->enqueue(op2)
> >
> > i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without
> waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will
> still carry multiple streams where each op belonging to different stream ?!
> [Ahmed] Correct. I guess a user could put two ops on the same burst that
> belong to the same stream. In that case it would be more efficient to
> merge the ops using scatter gather. Nonetheless, I would not add checks
> in my implementation to limit that use. The hardware does not perceive a
> difference between ops that came on one burst and ops that came on two
> different bursts. to the hardware they are all ops. What matters is
> which stream each op belongs to.
> > if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it
> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to
> enable it under some feature flag.
> >
> > However it looks like an add-on and if it doesn't change current definition of a burst and minimum
> expectation set on stateful processing described in this document, then IMO, you can propose this feature
> as an incremental patch on baseline version, in absence of which,
> > application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
> [Ahmed] Makes sense. I was worried that there might be fundamental
> limitations to this mode of use in the API design. That is why I wanted
> to share this use mode with you guys and see if it can be accommodated
> using an incremental patch in the future.
> >>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one
> stream
> >>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an
> dequeue_burst()
> >>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the
> ops
> >>> had been processed? This would turn it into a synchronous call which isn't the intent.
> >> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
> >> sits the hardware does not fundamentally change the parameters of the
> >> system as a whole. It just moves workflow management complexity down
> >> into the DPDK software layer. Rather there are real latency and
> >> throughput advantages (because of caching) that I want to expose.
> >>
[Fiona] ok, so I think we've agreed that this can be an option, as long as not required of
PMDs and enabled under an explicit capability - named something like
ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
@Ahmed, we'll leave it up to you to define details.
What's necessary is API text to describe the expected behaviour on any error conditions,
the pause/resume API, whether an API is expected to clean up if resume doesn't happen
and if there's any time limit on this, etc
But I wouldn't expect any changes to existing burst APIs, and all PMDs and applications
must be able to handle the default behaviour, i.e. with this capability disabled.
Specifically even if a PMD has this capability, if an application ignores it and only sends
one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
not be in a paused state and the PMD should not wait for a resume() to handle the 
next op sent for that stream.
Does that make sense?

> >> /// snip ///
> >
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-21 14:35                               ` Trahe, Fiona
@ 2018-02-21 19:35                                 ` Ahmed Mansour
  2018-02-22  4:47                                   ` Verma, Shally
  0 siblings, 1 reply; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-21 19:35 UTC (permalink / raw)
  To: Trahe, Fiona, Verma, Shally, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

On 2/21/2018 9:35 AM, Trahe, Fiona wrote:
> Hi Ahmed, Shally,
>
>
>> -----Original Message-----
>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>> Sent: Tuesday, February 20, 2018 7:56 PM
>> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>> Subject: Re: [RFC v2] doc compression API for DPDK
>>
>> /// snip ///
>>>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>>>>>> enqueue
>>>>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>>>>>> such case is independent of
>>>>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>>>>>> time, since each needs the
>>>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>>>>>> complex but doable.
>>>>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>>>>>> There may be many streams.
>>>>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>>>>> The caller would still need to adjust
>>>>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>>>>> OUT_OF_SPACE internally.
>>>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>>>>>> until it gets explicit
>>>>>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>>>>>> received by
>>>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>>>>>> the caller did not
>>>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>>>>> perhaps stream_resume().
>>>>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>>>>>> the issue and this op
>>>>>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>>>>>> this mode of use
>>>>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>>>>> that support this use
>>>>>>>>>> mode can allow many ops from the same session
>>>>>>>>>>
>>>>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I
>> would
>>>>>> still
>>>>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I
>> think
>>>>>> is
>>>>>>>>> the case with DPDK PMD).
>>>>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer
>> above
>>>>>> PMD
>>>>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and
>> output
>>>>>> and
>>>>>>>>> pass than as one op.
>>>>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>>>>>> impacting aspect also?
>>>>>>>>>
>>>>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with
>> consumed
>>>>>> + 1
>>>>>>>>> from next call onwards as it has already
>>>>>>>>> seen OUT_OF_SPACE.
>>>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>>>>> implementation rejects all ops that belong to a stream that has entered
>>>>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>>>>> acknowledge explicitly that it has received news of the problem before
>>>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>>>>>> that implementing this functionality in the software layer above the PMD
>>>>>>> is a bad idea since the latency reductions are lost.
>>>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity
>> in a
>>>>>> layer above PMD.
>>>>>>
>>>>>>> This setup is useful in latency sensitive applications where the latency
>>>>>>> of buffering multiple ops into one op is significant. We found latency
>>>>>>> makes a significant difference in search applications where the PMD
>>>>>>> competes with software decompression.
>>>>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>>>>> out-of-space occurs latency will probably be worse.
>>>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
>>>> rare occurrence. Recovering from it should take similar time to
>>>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>>>> both sync and async use. The caller can fix up the op and send it back
>>>> to the PMD to continue work just as would be done in sync. Nonetheless,
>>>> the added complexity is not justifiable if out-of-space is very common
>>>> since the recoverable state will be the limiting factor that forces
>>>> synchronicity.
>>>>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>>>>>> subsequent ops in that stream.
>>>>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>>>>>> Or somehow drop them? How?
>>>>>>>> While still processing ops form other streams.
>>>>>>> [Ahmed] This is exactly correct. It should return them with
>>>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>>>>>> would not want to open up each op to see which stream it's attached to and
>>>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without
>> processing.
>>>>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>>>>> The next op to be sent by the user will have a special marker in it to
>>>>>>> inform the PMD to continue working on this stream. Alternatively the
>>>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>>>>> the stream and its state, but like you say this adds additional CPU
>>>>>>> overhead during the enqueue.
>>>>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>>>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>>>>> to check the stream and incur the CPU overhead.
>>>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
>>>>> DPDK model we expect good behaviour and don't have to error check for things like this.
>>>> [Ahmed] This makes sense. We also assume good behavior.
>>>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
>>>>> build and send those messages. If we found an op from a stream which already
>>>>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
>>>>> only send 19 to hw. We cannot send multiple ops from same stream to
>>>>> the hw as it fans them out and does them in parallel.
>>>>> Once the enqueue_burst() returns, there is no processing
>>>>> context which would spot that the first has completed
>>>>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>>>>> in that context could process the next op in the stream.
>>>>> On out of space, instead of processing the next op we would have to transfer
>>>>> all unprocessed ops from the stream to the dequeue result.
>>>>> Some parts of this are doable, but seems likely to add a lot more latency,
>>>>> we'd need to add extra threads and timers to move ops from the sw
>>>>> queue to the hw q to get any benefit, and these constructs would add
>>>>> context switching and CPU cycles. So we prefer to push this responsibility
>>>>> to above the API and it can achieve similar.
>>>> [Ahmed] I see what you mean. Our workflow is almost exactly the same
>>>> with our hardware, but the fanning out is done by the hardware based on
>>>> the stream and ops that belong to the same stream are never allowed to
>>>> go out of order. Otherwise the data would be corrupted. Likewise the
>>>> hardware is responsible for checking the state of the stream and
>>>> returning frames as NOT_PROCESSED to the software
>>>>>>>> Maybe we could add a capability if this behaviour is important for you?
>>>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>>>>>> to be in flight at any time.
>>>>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>>>>>> not have to add extra checking for unsupported cases.
>>>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>>>>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please
>> note,
>>>>>> here burst is not carrying more than one stream)
>>>> [Ahmed] No. In this use case the caller sets up an op and enqueues a
>>>> single op. Then before the response comes back from the PMD the caller
>>>> enqueues a second op on the same stream.
>>>>>> -PMD will submit one op at a time to HW?
>>>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>>>> HW. I used DPDK to mean the software implementation that talks to the
>>>> hardware.
>>>> The software will submit all ops immediately. The hardware has to figure
>>>> out what to do with the ops depending on what stream they belong to.
>>>>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>>>>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>>>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue
>> count
>>>>>> = total # of ops submitted originally with burst?
>>>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
>>>> the HW. The HW will put all of them on the completion queue with the
>>>> correct status exactly as you say.
>>>>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>>>>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>>>>>> NOT_PROCESSED with updated input and output buffer?
>>>> [Ahmed] Correct this is what we do today in our proprietary API.
>>>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>>>>>> failure is seen, then app start whole processing all over again or just drop this burst?!
>>>> [Ahmed] The app has the choice on how to proceed. If the issue is
>>>> recoverable then the application can continue this stream from where it
>>>> stopped. if the failure is unrecoverable then the application should
>>>> first fix the problem and start from the beginning of the stream.
>>>>>> If all of above is true, then I think we should add another API such as
>> rte_comp_enque_single_stream()
>>>>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>>>> [Ahmed] The main advantage in async use is lost if we force all related
>>>> ops to be in the same burst. if we do that, then we might as well merge
>>>> all the ops into one op. That would reduce the overhead.
>>>> The use mode I am proposing is only useful in cases where the data
>>>> becomes available after the first enqueue occurred. I want to allow the
>>>> caller to enqueue the second set of data as soon as it is available
>>>> regardless of whether or not the HW has already started working on the
>>>> first op inflight.
>>> [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
>>>
>>> As per current description in doc, expected stateful usage is:
>>> enqueue (op1) --> dequeue(op1) --> enqueue(op2)
>>>
>>> but you're suggesting to allow an option to change it to
>>>
>>> enqueue(op1) -->enqueue(op2)
>>>
>>> i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without
>> waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will
>> still carry multiple streams where each op belonging to different stream ?!
>> [Ahmed] Correct. I guess a user could put two ops on the same burst that
>> belong to the same stream. In that case it would be more efficient to
>> merge the ops using scatter gather. Nonetheless, I would not add checks
>> in my implementation to limit that use. The hardware does not perceive a
>> difference between ops that came on one burst and ops that came on two
>> different bursts. to the hardware they are all ops. What matters is
>> which stream each op belongs to.
>>> if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it
>> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to
>> enable it under some feature flag.
>>> However it looks like an add-on and if it doesn't change current definition of a burst and minimum
>> expectation set on stateful processing described in this document, then IMO, you can propose this feature
>> as an incremental patch on baseline version, in absence of which,
>>> application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
>> [Ahmed] Makes sense. I was worried that there might be fundamental
>> limitations to this mode of use in the API design. That is why I wanted
>> to share this use mode with you guys and see if it can be accommodated
>> using an incremental patch in the future.
>>>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one
>> stream
>>>>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an
>> dequeue_burst()
>>>>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the
>> ops
>>>>> had been processed? This would turn it into a synchronous call which isn't the intent.
>>>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
>>>> sits the hardware does not fundamentally change the parameters of the
>>>> system as a whole. It just moves workflow management complexity down
>>>> into the DPDK software layer. Rather there are real latency and
>>>> throughput advantages (because of caching) that I want to expose.
>>>>
> [Fiona] ok, so I think we've agreed that this can be an option, as long as not required of
> PMDs and enabled under an explicit capability - named something like
> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
> @Ahmed, we'll leave it up to you to define details.
> What's necessary is API text to describe the expected behaviour on any error conditions,
> the pause/resume API, whether an API is expected to clean up if resume doesn't happen
> and if there's any time limit on this, etc
> But I wouldn't expect any changes to existing burst APIs, and all PMDs and applications
> must be able to handle the default behaviour, i.e. with this capability disabled.
> Specifically even if a PMD has this capability, if an application ignores it and only sends
> one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
> not be in a paused state and the PMD should not wait for a resume() to handle the 
> next op sent for that stream.
> Does that make sense?
[Ahmed] That make sense. When this mode is enabled then additional
functions must be called to resume the work, even if only one op was in
flight. When this mode is not enabled then the PMD assumes that the
caller will never enqueue a stateful op before receiving a response to
the one that precedes it in a stream
>
>>>> /// snip ///
>>>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-21 19:35                                 ` Ahmed Mansour
@ 2018-02-22  4:47                                   ` Verma, Shally
  2018-02-22 19:35                                     ` Ahmed Mansour
  0 siblings, 1 reply; 30+ messages in thread
From: Verma, Shally @ 2018-02-22  4:47 UTC (permalink / raw)
  To: Ahmed Mansour, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry



>-----Original Message-----
>From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>Sent: 22 February 2018 01:06
>To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
><Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
><Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>Subject: Re: [RFC v2] doc compression API for DPDK
>
>On 2/21/2018 9:35 AM, Trahe, Fiona wrote:
>> Hi Ahmed, Shally,
>>
>>
>>> -----Original Message-----
>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>> Sent: Tuesday, February 20, 2018 7:56 PM
>>> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>
>>> /// snip ///
>>>>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>>>>>>> enqueue
>>>>>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>>>>>>> such case is independent of
>>>>>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>>>>>>> time, since each needs the
>>>>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>>>>>>> complex but doable.
>>>>>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>>>>>>> There may be many streams.
>>>>>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>>>>>> The caller would still need to adjust
>>>>>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>>>>>> OUT_OF_SPACE internally.
>>>>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>>>>>>> until it gets explicit
>>>>>>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>>>>>>> received by
>>>>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>>>>>>> the caller did not
>>>>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>>>>>> perhaps stream_resume().
>>>>>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>>>>>>> the issue and this op
>>>>>>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>>>>>>> this mode of use
>>>>>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>>>>>> that support this use
>>>>>>>>>>> mode can allow many ops from the same session
>>>>>>>>>>>
>>>>>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I
>>> would
>>>>>>> still
>>>>>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I
>>> think
>>>>>>> is
>>>>>>>>>> the case with DPDK PMD).
>>>>>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer
>>> above
>>>>>>> PMD
>>>>>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and
>>> output
>>>>>>> and
>>>>>>>>>> pass than as one op.
>>>>>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>>>>>>> impacting aspect also?
>>>>>>>>>>
>>>>>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with
>>> consumed
>>>>>>> + 1
>>>>>>>>>> from next call onwards as it has already
>>>>>>>>>> seen OUT_OF_SPACE.
>>>>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>>>>>> implementation rejects all ops that belong to a stream that has entered
>>>>>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>>>>>> acknowledge explicitly that it has received news of the problem before
>>>>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>>>>>>> that implementing this functionality in the software layer above the PMD
>>>>>>>> is a bad idea since the latency reductions are lost.
>>>>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity
>>> in a
>>>>>>> layer above PMD.
>>>>>>>
>>>>>>>> This setup is useful in latency sensitive applications where the latency
>>>>>>>> of buffering multiple ops into one op is significant. We found latency
>>>>>>>> makes a significant difference in search applications where the PMD
>>>>>>>> competes with software decompression.
>>>>>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>>>>>> out-of-space occurs latency will probably be worse.
>>>>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
>>>>> rare occurrence. Recovering from it should take similar time to
>>>>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>>>>> both sync and async use. The caller can fix up the op and send it back
>>>>> to the PMD to continue work just as would be done in sync. Nonetheless,
>>>>> the added complexity is not justifiable if out-of-space is very common
>>>>> since the recoverable state will be the limiting factor that forces
>>>>> synchronicity.
>>>>>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>>>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>>>>>>> subsequent ops in that stream.
>>>>>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>>>>>>> Or somehow drop them? How?
>>>>>>>>> While still processing ops form other streams.
>>>>>>>> [Ahmed] This is exactly correct. It should return them with
>>>>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>>>>>>> would not want to open up each op to see which stream it's attached to and
>>>>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without
>>> processing.
>>>>>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>>>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>>>>>> The next op to be sent by the user will have a special marker in it to
>>>>>>>> inform the PMD to continue working on this stream. Alternatively the
>>>>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>>>>>> the stream and its state, but like you say this adds additional CPU
>>>>>>>> overhead during the enqueue.
>>>>>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>>>>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>>>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>>>>>> to check the stream and incur the CPU overhead.
>>>>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
>>>>>> DPDK model we expect good behaviour and don't have to error check for things like this.
>>>>> [Ahmed] This makes sense. We also assume good behavior.
>>>>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
>>>>>> build and send those messages. If we found an op from a stream which already
>>>>>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
>>>>>> only send 19 to hw. We cannot send multiple ops from same stream to
>>>>>> the hw as it fans them out and does them in parallel.
>>>>>> Once the enqueue_burst() returns, there is no processing
>>>>>> context which would spot that the first has completed
>>>>>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>>>>>> in that context could process the next op in the stream.
>>>>>> On out of space, instead of processing the next op we would have to transfer
>>>>>> all unprocessed ops from the stream to the dequeue result.
>>>>>> Some parts of this are doable, but seems likely to add a lot more latency,
>>>>>> we'd need to add extra threads and timers to move ops from the sw
>>>>>> queue to the hw q to get any benefit, and these constructs would add
>>>>>> context switching and CPU cycles. So we prefer to push this responsibility
>>>>>> to above the API and it can achieve similar.
>>>>> [Ahmed] I see what you mean. Our workflow is almost exactly the same
>>>>> with our hardware, but the fanning out is done by the hardware based on
>>>>> the stream and ops that belong to the same stream are never allowed to
>>>>> go out of order. Otherwise the data would be corrupted. Likewise the
>>>>> hardware is responsible for checking the state of the stream and
>>>>> returning frames as NOT_PROCESSED to the software
>>>>>>>>> Maybe we could add a capability if this behaviour is important for you?
>>>>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>>>>>>> to be in flight at any time.
>>>>>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>>>>>>> not have to add extra checking for unsupported cases.
>>>>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>>>>>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please
>>> note,
>>>>>>> here burst is not carrying more than one stream)
>>>>> [Ahmed] No. In this use case the caller sets up an op and enqueues a
>>>>> single op. Then before the response comes back from the PMD the caller
>>>>> enqueues a second op on the same stream.
>>>>>>> -PMD will submit one op at a time to HW?
>>>>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>>>>> HW. I used DPDK to mean the software implementation that talks to the
>>>>> hardware.
>>>>> The software will submit all ops immediately. The hardware has to figure
>>>>> out what to do with the ops depending on what stream they belong to.
>>>>>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>>>>>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>>>>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue
>>> count
>>>>>>> = total # of ops submitted originally with burst?
>>>>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
>>>>> the HW. The HW will put all of them on the completion queue with the
>>>>> correct status exactly as you say.
>>>>>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>>>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>>>>>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>>>>>>> NOT_PROCESSED with updated input and output buffer?
>>>>> [Ahmed] Correct this is what we do today in our proprietary API.
>>>>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>>>>>>> failure is seen, then app start whole processing all over again or just drop this burst?!
>>>>> [Ahmed] The app has the choice on how to proceed. If the issue is
>>>>> recoverable then the application can continue this stream from where it
>>>>> stopped. if the failure is unrecoverable then the application should
>>>>> first fix the problem and start from the beginning of the stream.
>>>>>>> If all of above is true, then I think we should add another API such as
>>> rte_comp_enque_single_stream()
>>>>>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>>>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>>>>> [Ahmed] The main advantage in async use is lost if we force all related
>>>>> ops to be in the same burst. if we do that, then we might as well merge
>>>>> all the ops into one op. That would reduce the overhead.
>>>>> The use mode I am proposing is only useful in cases where the data
>>>>> becomes available after the first enqueue occurred. I want to allow the
>>>>> caller to enqueue the second set of data as soon as it is available
>>>>> regardless of whether or not the HW has already started working on the
>>>>> first op inflight.
>>>> [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
>>>>
>>>> As per current description in doc, expected stateful usage is:
>>>> enqueue (op1) --> dequeue(op1) --> enqueue(op2)
>>>>
>>>> but you're suggesting to allow an option to change it to
>>>>
>>>> enqueue(op1) -->enqueue(op2)
>>>>
>>>> i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without
>>> waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will
>>> still carry multiple streams where each op belonging to different stream ?!
>>> [Ahmed] Correct. I guess a user could put two ops on the same burst that
>>> belong to the same stream. In that case it would be more efficient to
>>> merge the ops using scatter gather. Nonetheless, I would not add checks
>>> in my implementation to limit that use. The hardware does not perceive a
>>> difference between ops that came on one burst and ops that came on two
>>> different bursts. to the hardware they are all ops. What matters is
>>> which stream each op belongs to.
>>>> if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it
>>> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to
>>> enable it under some feature flag.
>>>> However it looks like an add-on and if it doesn't change current definition of a burst and minimum
>>> expectation set on stateful processing described in this document, then IMO, you can propose this feature
>>> as an incremental patch on baseline version, in absence of which,
>>>> application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
>>> [Ahmed] Makes sense. I was worried that there might be fundamental
>>> limitations to this mode of use in the API design. That is why I wanted
>>> to share this use mode with you guys and see if it can be accommodated
>>> using an incremental patch in the future.
>>>>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one
>>> stream
>>>>>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an
>>> dequeue_burst()
>>>>>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the
>>> ops
>>>>>> had been processed? This would turn it into a synchronous call which isn't the intent.
>>>>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
>>>>> sits the hardware does not fundamentally change the parameters of the
>>>>> system as a whole. It just moves workflow management complexity down
>>>>> into the DPDK software layer. Rather there are real latency and
>>>>> throughput advantages (because of caching) that I want to expose.
>>>>>
>> [Fiona] ok, so I think we've agreed that this can be an option, as long as not required of
>> PMDs and enabled under an explicit capability - named something like
>> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
>> @Ahmed, we'll leave it up to you to define details.
>> What's necessary is API text to describe the expected behaviour on any error conditions,
>> the pause/resume API, whether an API is expected to clean up if resume doesn't happen
>> and if there's any time limit on this, etc
>> But I wouldn't expect any changes to existing burst APIs, and all PMDs and applications
>> must be able to handle the default behaviour, i.e. with this capability disabled.
>> Specifically even if a PMD has this capability, if an application ignores it and only sends
>> one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
>> not be in a paused state and the PMD should not wait for a resume() to handle the
>> next op sent for that stream.
>> Does that make sense?
>[Ahmed] That make sense. When this mode is enabled then additional
>functions must be called to resume the work, even if only one op was in
>flight. When this mode is not enabled then the PMD assumes that the
>caller will never enqueue a stateful op before receiving a response to
>the one that precedes it in a stream

[Shally] @ahmed , just to confirm on this

>When this mode is not enabled then the PMD assumes that the caller will never enqueue a stateful op ...

I think what we want to ensure reverse of it i.e. "if mode is *enabled*, then also PMD should assume that caller can use enqueue->dequeue->enqueue sequence for stateful processing and if on deque, 
he discover OUT_OF_SPACE_RECOVERABLE and call enqueue() again to handle it , that should be also be supported by PMD" . 
In a sense, an application written for one PMD which doesn't have this capability should also work for PMD which has this capability.

>>
>>>>> /// snip ///
>>>>
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dpdk-dev] [RFC v2] doc compression API for DPDK
  2018-02-22  4:47                                   ` Verma, Shally
@ 2018-02-22 19:35                                     ` Ahmed Mansour
  0 siblings, 0 replies; 30+ messages in thread
From: Ahmed Mansour @ 2018-02-22 19:35 UTC (permalink / raw)
  To: Verma, Shally, Trahe, Fiona, dev
  Cc: Athreya, Narayana Prasad, Gupta, Ashish, Sahu, Sunila,
	De Lara Guarch, Pablo, Challa, Mahipal, Jain, Deepak K,
	Hemant Agrawal, Roy Pledge, Youri Querry

On 2/21/2018 11:47 PM, Verma, Shally wrote:
>
>> -----Original Message-----
>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>> Sent: 22 February 2018 01:06
>> To: Trahe, Fiona <fiona.trahe@intel.com>; Verma, Shally <Shally.Verma@cavium.com>; dev@dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish <Ashish.Gupta@cavium.com>; Sahu, Sunila
>> <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Challa, Mahipal
>> <Mahipal.Challa@cavium.com>; Jain, Deepak K <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy
>> Pledge <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>> Subject: Re: [RFC v2] doc compression API for DPDK
>>
>> On 2/21/2018 9:35 AM, Trahe, Fiona wrote:
>>> Hi Ahmed, Shally,
>>>
>>>
>>>> -----Original Message-----
>>>> From: Ahmed Mansour [mailto:ahmed.mansour@nxp.com]
>>>> Sent: Tuesday, February 20, 2018 7:56 PM
>>>> To: Verma, Shally <Shally.Verma@cavium.com>; Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
>>>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya@cavium.com>; Gupta, Ashish
>>>> <Ashish.Gupta@cavium.com>; Sahu, Sunila <Sunila.Sahu@cavium.com>; De Lara Guarch, Pablo
>>>> <pablo.de.lara.guarch@intel.com>; Challa, Mahipal <Mahipal.Challa@cavium.com>; Jain, Deepak K
>>>> <deepak.k.jain@intel.com>; Hemant Agrawal <hemant.agrawal@nxp.com>; Roy Pledge
>>>> <roy.pledge@nxp.com>; Youri Querry <youri.querry_1@nxp.com>
>>>> Subject: Re: [RFC v2] doc compression API for DPDK
>>>>
>>>> /// snip ///
>>>>>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>>>>>>>> enqueue
>>>>>>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>>>>>>>> such case is independent of
>>>>>>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>>>>>>>> time, since each needs the
>>>>>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>>>>>>>> complex but doable.
>>>>>>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>>>>>>>> There may be many streams.
>>>>>>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>>>>>>> The caller would still need to adjust
>>>>>>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>>>>>>> OUT_OF_SPACE internally.
>>>>>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>>>>>>>> until it gets explicit
>>>>>>>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>>>>>>>> received by
>>>>>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>>>>>>>> the caller did not
>>>>>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>>>>>>> perhaps stream_resume().
>>>>>>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>>>>>>>> the issue and this op
>>>>>>>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>>>>>>>> this mode of use
>>>>>>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>>>>>>> that support this use
>>>>>>>>>>>> mode can allow many ops from the same session
>>>>>>>>>>>>
>>>>>>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I
>>>> would
>>>>>>>> still
>>>>>>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I
>>>> think
>>>>>>>> is
>>>>>>>>>>> the case with DPDK PMD).
>>>>>>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer
>>>> above
>>>>>>>> PMD
>>>>>>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and
>>>> output
>>>>>>>> and
>>>>>>>>>>> pass than as one op.
>>>>>>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>>>>>>>> impacting aspect also?
>>>>>>>>>>>
>>>>>>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>>>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with
>>>> consumed
>>>>>>>> + 1
>>>>>>>>>>> from next call onwards as it has already
>>>>>>>>>>> seen OUT_OF_SPACE.
>>>>>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>>>>>>> implementation rejects all ops that belong to a stream that has entered
>>>>>>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>>>>>>> acknowledge explicitly that it has received news of the problem before
>>>>>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>>>>>>>> that implementing this functionality in the software layer above the PMD
>>>>>>>>> is a bad idea since the latency reductions are lost.
>>>>>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity
>>>> in a
>>>>>>>> layer above PMD.
>>>>>>>>
>>>>>>>>> This setup is useful in latency sensitive applications where the latency
>>>>>>>>> of buffering multiple ops into one op is significant. We found latency
>>>>>>>>> makes a significant difference in search applications where the PMD
>>>>>>>>> competes with software decompression.
>>>>>>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>>>>>>> out-of-space occurs latency will probably be worse.
>>>>>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
>>>>>> rare occurrence. Recovering from it should take similar time to
>>>>>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>>>>>> both sync and async use. The caller can fix up the op and send it back
>>>>>> to the PMD to continue work just as would be done in sync. Nonetheless,
>>>>>> the added complexity is not justifiable if out-of-space is very common
>>>>>> since the recoverable state will be the limiting factor that forces
>>>>>> synchronicity.
>>>>>>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>>>>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>>>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>>>>>>>> subsequent ops in that stream.
>>>>>>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>>>>>>>> Or somehow drop them? How?
>>>>>>>>>> While still processing ops form other streams.
>>>>>>>>> [Ahmed] This is exactly correct. It should return them with
>>>>>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>>>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>>>>>>>> would not want to open up each op to see which stream it's attached to and
>>>>>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without
>>>> processing.
>>>>>>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>>>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>>>>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>>>>>>> The next op to be sent by the user will have a special marker in it to
>>>>>>>>> inform the PMD to continue working on this stream. Alternatively the
>>>>>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>>>>>>> the stream and its state, but like you say this adds additional CPU
>>>>>>>>> overhead during the enqueue.
>>>>>>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>>>>>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>>>>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>>>>>>> to check the stream and incur the CPU overhead.
>>>>>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
>>>>>>> DPDK model we expect good behaviour and don't have to error check for things like this.
>>>>>> [Ahmed] This makes sense. We also assume good behavior.
>>>>>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
>>>>>>> build and send those messages. If we found an op from a stream which already
>>>>>>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
>>>>>>> only send 19 to hw. We cannot send multiple ops from same stream to
>>>>>>> the hw as it fans them out and does them in parallel.
>>>>>>> Once the enqueue_burst() returns, there is no processing
>>>>>>> context which would spot that the first has completed
>>>>>>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>>>>>>> in that context could process the next op in the stream.
>>>>>>> On out of space, instead of processing the next op we would have to transfer
>>>>>>> all unprocessed ops from the stream to the dequeue result.
>>>>>>> Some parts of this are doable, but seems likely to add a lot more latency,
>>>>>>> we'd need to add extra threads and timers to move ops from the sw
>>>>>>> queue to the hw q to get any benefit, and these constructs would add
>>>>>>> context switching and CPU cycles. So we prefer to push this responsibility
>>>>>>> to above the API and it can achieve similar.
>>>>>> [Ahmed] I see what you mean. Our workflow is almost exactly the same
>>>>>> with our hardware, but the fanning out is done by the hardware based on
>>>>>> the stream and ops that belong to the same stream are never allowed to
>>>>>> go out of order. Otherwise the data would be corrupted. Likewise the
>>>>>> hardware is responsible for checking the state of the stream and
>>>>>> returning frames as NOT_PROCESSED to the software
>>>>>>>>>> Maybe we could add a capability if this behaviour is important for you?
>>>>>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>>>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>>>>>>>> to be in flight at any time.
>>>>>>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>>>>>>>> not have to add extra checking for unsupported cases.
>>>>>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>>>>>>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please
>>>> note,
>>>>>>>> here burst is not carrying more than one stream)
>>>>>> [Ahmed] No. In this use case the caller sets up an op and enqueues a
>>>>>> single op. Then before the response comes back from the PMD the caller
>>>>>> enqueues a second op on the same stream.
>>>>>>>> -PMD will submit one op at a time to HW?
>>>>>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>>>>>> HW. I used DPDK to mean the software implementation that talks to the
>>>>>> hardware.
>>>>>> The software will submit all ops immediately. The hardware has to figure
>>>>>> out what to do with the ops depending on what stream they belong to.
>>>>>>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>>>>>>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>>>>>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue
>>>> count
>>>>>>>> = total # of ops submitted originally with burst?
>>>>>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
>>>>>> the HW. The HW will put all of them on the completion queue with the
>>>>>> correct status exactly as you say.
>>>>>>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>>>>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>>>>>>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>>>>>>>> NOT_PROCESSED with updated input and output buffer?
>>>>>> [Ahmed] Correct this is what we do today in our proprietary API.
>>>>>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>>>>>>>> failure is seen, then app start whole processing all over again or just drop this burst?!
>>>>>> [Ahmed] The app has the choice on how to proceed. If the issue is
>>>>>> recoverable then the application can continue this stream from where it
>>>>>> stopped. if the failure is unrecoverable then the application should
>>>>>> first fix the problem and start from the beginning of the stream.
>>>>>>>> If all of above is true, then I think we should add another API such as
>>>> rte_comp_enque_single_stream()
>>>>>>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>>>>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>>>>>> [Ahmed] The main advantage in async use is lost if we force all related
>>>>>> ops to be in the same burst. if we do that, then we might as well merge
>>>>>> all the ops into one op. That would reduce the overhead.
>>>>>> The use mode I am proposing is only useful in cases where the data
>>>>>> becomes available after the first enqueue occurred. I want to allow the
>>>>>> caller to enqueue the second set of data as soon as it is available
>>>>>> regardless of whether or not the HW has already started working on the
>>>>>> first op inflight.
>>>>> [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
>>>>>
>>>>> As per current description in doc, expected stateful usage is:
>>>>> enqueue (op1) --> dequeue(op1) --> enqueue(op2)
>>>>>
>>>>> but you're suggesting to allow an option to change it to
>>>>>
>>>>> enqueue(op1) -->enqueue(op2)
>>>>>
>>>>> i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without
>>>> waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will
>>>> still carry multiple streams where each op belonging to different stream ?!
>>>> [Ahmed] Correct. I guess a user could put two ops on the same burst that
>>>> belong to the same stream. In that case it would be more efficient to
>>>> merge the ops using scatter gather. Nonetheless, I would not add checks
>>>> in my implementation to limit that use. The hardware does not perceive a
>>>> difference between ops that came on one burst and ops that came on two
>>>> different bursts. to the hardware they are all ops. What matters is
>>>> which stream each op belongs to.
>>>>> if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it
>>>> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to
>>>> enable it under some feature flag.
>>>>> However it looks like an add-on and if it doesn't change current definition of a burst and minimum
>>>> expectation set on stateful processing described in this document, then IMO, you can propose this feature
>>>> as an incremental patch on baseline version, in absence of which,
>>>>> application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
>>>> [Ahmed] Makes sense. I was worried that there might be fundamental
>>>> limitations to this mode of use in the API design. That is why I wanted
>>>> to share this use mode with you guys and see if it can be accommodated
>>>> using an incremental patch in the future.
>>>>>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one
>>>> stream
>>>>>>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an
>>>> dequeue_burst()
>>>>>>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the
>>>> ops
>>>>>>> had been processed? This would turn it into a synchronous call which isn't the intent.
>>>>>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
>>>>>> sits the hardware does not fundamentally change the parameters of the
>>>>>> system as a whole. It just moves workflow management complexity down
>>>>>> into the DPDK software layer. Rather there are real latency and
>>>>>> throughput advantages (because of caching) that I want to expose.
>>>>>>
>>> [Fiona] ok, so I think we've agreed that this can be an option, as long as not required of
>>> PMDs and enabled under an explicit capability - named something like
>>> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
>>> @Ahmed, we'll leave it up to you to define details.
>>> What's necessary is API text to describe the expected behaviour on any error conditions,
>>> the pause/resume API, whether an API is expected to clean up if resume doesn't happen
>>> and if there's any time limit on this, etc
>>> But I wouldn't expect any changes to existing burst APIs, and all PMDs and applications
>>> must be able to handle the default behaviour, i.e. with this capability disabled.
>>> Specifically even if a PMD has this capability, if an application ignores it and only sends
>>> one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
>>> not be in a paused state and the PMD should not wait for a resume() to handle the
>>> next op sent for that stream.
>>> Does that make sense?
>> [Ahmed] That make sense. When this mode is enabled then additional
>> functions must be called to resume the work, even if only one op was in
>> flight. When this mode is not enabled then the PMD assumes that the
>> caller will never enqueue a stateful op before receiving a response to
>> the one that precedes it in a stream
> [Shally] @ahmed , just to confirm on this
>
>> When this mode is not enabled then the PMD assumes that the caller will never enqueue a stateful op ...
> I think what we want to ensure reverse of it i.e. "if mode is *enabled*, then also PMD should assume that caller can use enqueue->dequeue->enqueue sequence for stateful processing and if on deque, 
> he discover OUT_OF_SPACE_RECOVERABLE and call enqueue() again to handle it , that should be also be supported by PMD" . 
> In a sense, an application written for one PMD which doesn't have this capability should also work for PMD which has this capability.
>
[Ahmed] That creates a race condition. Async stateful i.e.
enqueue->enqueue->dequeue requires the user to explicitly acknowledge
and solve the recoverable op. The PMD cannot assume that any particular
op is the response to a recoverable condition. A lock around enqueue
dequeue also does not resolve the issue since the decision to resolve
the issue must be entirely made by the caller and the timing of that
decision is outside the knowledge of the PMD.
>>>>>> /// snip ///
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-02-22 19:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-04 11:45 [dpdk-dev] [RFC v2] doc compression API for DPDK Verma, Shally
2018-01-09 19:07 ` Ahmed Mansour
2018-01-10 12:55   ` Verma, Shally
2018-01-11 18:53     ` Trahe, Fiona
2018-01-12 13:49       ` Verma, Shally
2018-01-25 18:19         ` Ahmed Mansour
2018-01-29 12:47           ` Verma, Shally
2018-01-31 19:03           ` Trahe, Fiona
2018-02-01  5:40             ` Verma, Shally
2018-02-01 11:54               ` Trahe, Fiona
2018-02-01 20:50                 ` Ahmed Mansour
2018-02-14  5:41                   ` Verma, Shally
2018-02-14 16:54                     ` Ahmed Mansour
2018-02-15  5:53                       ` Verma, Shally
2018-02-15 17:20                         ` Trahe, Fiona
2018-02-15 19:51                           ` Ahmed Mansour
2018-02-16 11:11                             ` Trahe, Fiona
2018-02-01 20:23             ` Ahmed Mansour
2018-02-14  7:41               ` Verma, Shally
2018-02-15 18:47                 ` Trahe, Fiona
2018-02-15 21:09                   ` Ahmed Mansour
2018-02-16  7:16                     ` Verma, Shally
2018-02-16 13:04                       ` Trahe, Fiona
2018-02-16 21:21                         ` Ahmed Mansour
2018-02-20  9:58                           ` Verma, Shally
2018-02-20 19:56                             ` Ahmed Mansour
2018-02-21 14:35                               ` Trahe, Fiona
2018-02-21 19:35                                 ` Ahmed Mansour
2018-02-22  4:47                                   ` Verma, Shally
2018-02-22 19:35                                     ` Ahmed Mansour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).