DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] Application used for DSW event_dev performance testing
@ 2018-11-14 19:16 Venky Venkatesh
  2018-11-14 19:41 ` Mattias Rönnblom
  0 siblings, 1 reply; 7+ messages in thread
From: Venky Venkatesh @ 2018-11-14 19:16 UTC (permalink / raw)
  To: dev

Hi,

https://mails.dpdk.org/archives/dev/2018-September/111344.html mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline
with 1000 clock cycles of work per stage, the average event device
overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available?

We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it?

Thanks
--Venky



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] Application used for DSW event_dev performance testing
  2018-11-14 19:16 [dpdk-dev] Application used for DSW event_dev performance testing Venky Venkatesh
@ 2018-11-14 19:41 ` Mattias Rönnblom
  2018-11-14 21:56   ` Venky Venkatesh
  0 siblings, 1 reply; 7+ messages in thread
From: Mattias Rönnblom @ 2018-11-14 19:41 UTC (permalink / raw)
  To: Venky Venkatesh, dev

On 2018-11-14 20:16, Venky Venkatesh wrote:
> Hi,
> 
> https://mails.dpdk.org/archives/dev/2018-September/111344.html mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline
> with 1000 clock cycles of work per stage, the average event device
> overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available?
> 
It's proprietary code, although it's also been tested by some of our 
partners.

The primary reason for it not being contributed to DPDK is because it's 
a fair amount of work to do so. I would refer to it as an eventdev 
pipeline simulator, rather than a sample app.

> We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it?
> 

Is this enqueue or dequeue burst? How large is n? Is this explicit release?

What do you set nb_events_limit to? Good DSW performance much depends on 
the average burst size on the event rings, which in turn is dependent on 
the number of in-flight events. On really high core-count systems you 
might also want to increase DSW_MAX_PORT_OPS_PER_BG_TASK, since it 
effectively puts a limit on the maximum number of events buffered on the 
output buffers.

In the pipeline simulator all cores produce events initially, and then 
recycles events when the number of in-flight events reach a certain 
threshold (50% of nb_events_limit). A single lcore won't be able to fill 
the pipeline, if you have zero-work stages.

Even though I can't send you the simulator code at this point, I'm happy 
to assist you in any DSW-related endeavors.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] Application used for DSW event_dev performance testing
  2018-11-14 19:41 ` Mattias Rönnblom
@ 2018-11-14 21:56   ` Venky Venkatesh
  2018-11-15  5:46     ` Mattias Rönnblom
  0 siblings, 1 reply; 7+ messages in thread
From: Venky Venkatesh @ 2018-11-14 21:56 UTC (permalink / raw)
  To: Mattias Rönnblom, dev

Mattias,
Thanks for the prompt response. Appreciate your situation of not being able to share the proprietary code. More answers inline as [VV]:
--Venky 

On 11/14/18, 11:41 AM, "Mattias Rönnblom" <hofors@lysator.liu.se> wrote:

    On 2018-11-14 20:16, Venky Venkatesh wrote:
    > Hi,
    > 
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__mails.dpdk.org_archives_dev_2018-2DSeptember_111344.html&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=w2W5SR0mU5u5mz008DZNCsexDN1Lr9bpL7ZGKuD0Zd4&m=H4I6cuKi4kKoypKWz8mjDoXLGgkSNurKbKXrq4qJs5A&s=AD0KG106hPreSKeTQMRzDPwnEfBR9oD6dtjpL2Plt4c&e= mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline
    > with 1000 clock cycles of work per stage, the average event device
    > overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available?
    > 
    It's proprietary code, although it's also been tested by some of our 
    partners.
    
    The primary reason for it not being contributed to DPDK is because it's 
    a fair amount of work to do so. I would refer to it as an eventdev 
    pipeline simulator, rather than a sample app.
    
    > We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it?
    > 
    
    Is this enqueue or dequeue burst? How large is n? Is this explicit release?
 [VV]: Yes both are burst of 32. I tried n=4-7. It is explicit RELEASE.
   
    What do you set nb_events_limit to? Good DSW performance much depends on 
    the average burst size on the event rings, which in turn is dependent on 
    the number of in-flight events. On really high core-count systems you 
    might also want to increase DSW_MAX_PORT_OPS_PER_BG_TASK, since it 
    effectively puts a limit on the maximum number of events buffered on the 
    output buffers.
[VV]:         struct rte_event_dev_config config = {
                        .nb_event_queues = 2,
                        .nb_event_ports = 5,
                        .nb_events_limit  = 4096,
                        .nb_event_queue_flows = 1024,
                        .nb_event_port_dequeue_depth = 128,
                        .nb_event_port_enqueue_depth = 128,
        };
        struct rte_event_port_conf p_conf = {
                        .dequeue_depth = 64,
                        .enqueue_depth = 64,
                        .new_event_threshold = 1024,
                        .disable_implicit_release = 0,
        };
        struct rte_event_queue_conf q_conf = {
                        .schedule_type = RTE_SCHED_TYPE_ATOMIC,
                        .priority = RTE_EVENT_DEV_PRIORITY_NORMAL,
                        .nb_atomic_flows = 1024,
                        .nb_atomic_order_sequences = 1024,
        };

    
    In the pipeline simulator all cores produce events initially, and then 
    recycles events when the number of in-flight events reach a certain 
    threshold (50% of nb_events_limit). A single lcore won't be able to fill 
    the pipeline, if you have zero-work stages.
[VV]: I have a single NEW event enqueue thread(0) and a bunch of “dequeue and RELEASE” threads (1-4) – simple case. I have a stats print thread(5) as well. If the 1 enqueue thread is unable to fill the pipeline, what counter would indicate that? I see the contrary effect -- I am tracking the number of times enqueue fails and that number is large.

    
    Even though I can't send you the simulator code at this point, I'm happy 
    to assist you in any DSW-related endeavors.
[VV]: My program is a simple enough program (nothing proprietary) that I can share. Can I unicast it to you for a quick recommendation?    


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] Application used for DSW event_dev performance testing
  2018-11-14 21:56   ` Venky Venkatesh
@ 2018-11-15  5:46     ` Mattias Rönnblom
  2018-11-27 22:33       ` Venky Venkatesh
  0 siblings, 1 reply; 7+ messages in thread
From: Mattias Rönnblom @ 2018-11-15  5:46 UTC (permalink / raw)
  To: Venky Venkatesh, dev

On 2018-11-14 22:56, Venky Venkatesh wrote:
> Mattias,
> Thanks for the prompt response. Appreciate your situation of not being able to share the proprietary code. More answers inline as [VV]:
> --Venky
> 
> On 11/14/18, 11:41 AM, "Mattias Rönnblom" <hofors@lysator.liu.se> wrote:
> 
>      On 2018-11-14 20:16, Venky Venkatesh wrote:
>      > Hi,
>      >
>      > https://urldefense.proofpoint.com/v2/url?u=https-3A__mails.dpdk.org_archives_dev_2018-2DSeptember_111344.html&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=w2W5SR0mU5u5mz008DZNCsexDN1Lr9bpL7ZGKuD0Zd4&m=H4I6cuKi4kKoypKWz8mjDoXLGgkSNurKbKXrq4qJs5A&s=AD0KG106hPreSKeTQMRzDPwnEfBR9oD6dtjpL2Plt4c&e= mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline
>      > with 1000 clock cycles of work per stage, the average event device
>      > overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available?
>      >
>      It's proprietary code, although it's also been tested by some of our
>      partners.
>      
>      The primary reason for it not being contributed to DPDK is because it's
>      a fair amount of work to do so. I would refer to it as an eventdev
>      pipeline simulator, rather than a sample app.
>      
>      > We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it?
>      >
>      
>      Is this enqueue or dequeue burst? How large is n? Is this explicit release?
>   [VV]: Yes both are burst of 32. I tried n=4-7. It is explicit RELEASE.
>     

If you want good scheduler throughput, don't do explicit release. With 
other event devices, and heavy-weight pipelines, there might be a point 
to do so because the released event's flow could potentially be 
scheduled on other cores. However, on DSW migration won't happen until 
the application has finished processing its burst, at the earliest.

DSW does buffer on enqueue, so large enqueue bursts doesn't improve 
performance much. They should not decrease performance, unless you go 
above the configured max burst.

>      What do you set nb_events_limit to? Good DSW performance much depends on
>      the average burst size on the event rings, which in turn is dependent on
>      the number of in-flight events. On really high core-count systems you
>      might also want to increase DSW_MAX_PORT_OPS_PER_BG_TASK, since it
>      effectively puts a limit on the maximum number of events buffered on the
>      output buffers.
> [VV]:         struct rte_event_dev_config config = {
>                          .nb_event_queues = 2,
>                          .nb_event_ports = 5,
>                          .nb_events_limit  = 4096,
>                          .nb_event_queue_flows = 1024,
>                          .nb_event_port_dequeue_depth = 128,
>                          .nb_event_port_enqueue_depth = 128,
>          }; >          struct rte_event_port_conf p_conf = {
>                          .dequeue_depth = 64,
>                          .enqueue_depth = 64,
>                          .new_event_threshold = 1024,

"new_event_threshold" effectively puts a limit on the number of inflight 
events. You should increase this to something close to "nb_events_limit".

>                          .disable_implicit_release = 0,
>          };
>          struct rte_event_queue_conf q_conf = {
>                          .schedule_type = RTE_SCHED_TYPE_ATOMIC,
>                          .priority = RTE_EVENT_DEV_PRIORITY_NORMAL,
>                          .nb_atomic_flows = 1024,
>                          .nb_atomic_order_sequences = 1024,
>          };

>      
>      In the pipeline simulator all cores produce events initially, and then
>      recycles events when the number of in-flight events reach a certain
>      threshold (50% of nb_events_limit). A single lcore won't be able to fill
>      the pipeline, if you have zero-work stages.
> [VV]: I have a single NEW event enqueue thread(0) and a bunch of “dequeue and RELEASE” threads (1-4) – simple case. I have a stats print thread(5) as well. If the 1 enqueue thread is unable to fill the pipeline, what counter would indicate that? I see the contrary effect -- I am tracking the number of times enqueue fails and that number is large.
> 
>      
There's no counter for failed enqueues, although maybe there should be. 
"dev_credits_on_load" can be seen as an estimate of how many events are 
currently inflight in the scheduler. If this number is close to your 
"new_event_threshold", the pipeline is busy. If it's low, in the 
couple-of-hundreds range, your pipeline is likely not-so-busy (even 
idle) because not enough events are being fed into it.

You can obviously detect failed NEW enqueues in the application as well.

I'm not sure exactly how much one core can produce, and it obviously 
depends on what kind of core, but it's certainly a lot lower than 
"300-400 millions events/s". Maybe something like 40-50 Mevents/s.

What is your flow id distribution? As in, how many flow ids are you 
actively using in the events are you feeding the different 
queues/pipeline stages.

>      Even though I can't send you the simulator code at this point, I'm happy
>      to assist you in any DSW-related endeavors.
> [VV]: My program is a simple enough program (nothing proprietary) that I can share. Can I unicast it to you for a quick recommendation?
> 

Sure, although I prefer to have any discussions on the mailing list, so 
other users can learn from your experiences.

Btw, you really need to get a proper mail user agent, or configure the 
one you have to quote messages as per normal convention.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] Application used for DSW event_dev performance testing
  2018-11-15  5:46     ` Mattias Rönnblom
@ 2018-11-27 22:33       ` Venky Venkatesh
  2018-11-28 16:55         ` Mattias Rönnblom
  0 siblings, 1 reply; 7+ messages in thread
From: Venky Venkatesh @ 2018-11-27 22:33 UTC (permalink / raw)
  To: Mattias Rönnblom, dev


On 11/14/18, 9:46 PM, "Mattias Rönnblom" <mattias.ronnblom@ericsson.com> wrote:
>
>
> On 11/14/18, 9:46 PM, "Mattias Rönnblom" <mattias.ronnblom@ericsson.com> wrote:
>
>     On 2018-11-14 22:56, Venky Venkatesh wrote:
>     > Mattias,
>     > Thanks for the prompt response. Appreciate your situation of not being able to share the proprietary code. More answers inline as [VV]:
>     > --Venky
>     >
>     > On 11/14/18, 11:41 AM, "Mattias Rönnblom" <hofors@lysator.liu.se> wrote:
>     >
>     >      On 2018-11-14 20:16, Venky Venkatesh wrote:
>     >      > Hi,
>     >      >
>     >      > https://urldefense.proofpoint.com/v2/url?u=https-3A__mails.dpdk.org_archives_dev_2018-2DSeptember_111344.html&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=w2W5SR0mU5u5mz008DZNCsexDN1Lr9bpL7ZGKuD0Zd4&m=H4I6cuKi4kKoypKWz8mjDoXLGgkSNurKbKXrq4qJs5A&s=AD0KG106hPreSKeTQMRzDPwnEfBR9oD6dtjpL2Plt4c&e= mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline
>     >      > with 1000 clock cycles of work per stage, the average event device
>     >      > overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available?
>     >      >
>     >      It's proprietary code, although it's also been tested by some of our
>     >      partners.
>     >
>     >      The primary reason for it not being contributed to DPDK is because it's
>     >      a fair amount of work to do so. I would refer to it as an eventdev
>     >      pipeline simulator, rather than a sample app.
>     >
>     >      > We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it?
>     >      >
>     >
>     >      Is this enqueue or dequeue burst? How large is n? Is this explicit release?
>     >   [VV]: Yes both are burst of 32. I tried n=4-7. It is explicit RELEASE.
>     >
>
>     If you want good scheduler throughput, don't do explicit release. With
>     other event devices, and heavy-weight pipelines, there might be a point
>     to do so because the released event's flow could potentially be
>     scheduled on other cores. However, on DSW migration won't happen until
>     the application has finished processing its burst, at the earliest.
>
I am getting ~25M events/sec on a single core dequeue (i.e. n=1) with no additional work after dequeue. I then introduced some work after dequeue and the performance falls steeply to 1M events/sec! Profiling seems to indicate all the time in our work -- though the work is just (where all of these are stack variables):
                for (k = 0; k < n1; k++)
                    for (j=0; j < n2; j ++)
                        temp += access[j];
Profiling also indicates that it is not memory bound. 

The bigger surprise was when I moved to multiple cores (i.e. n > 1). E.g. when I used n=2-5 I was expecting to do better than the 1M events/sec of single core. Instead it had 0.53M, 0.32M, 0.20M, 0.13M respectively. Thus it is decreasing with adding cores rather steeply!
BTW I have removed the explicit RELEASE ENQ.  All I do in the worker_loop function is the following 

        int dev_id = 0, i,j, k;
        struct rte_event ev[BURST];
        int access[1024], temp;
        int n_evs = rte_event_dequeue_burst(dev_id, port, ev, BURST, 0);
        for (i = 0; i < n_evs; i++) {
            /* do work for the event, set ev to next eventdev queue */
            switch (ev[i].queue_id) {
            case DEMO_STAGE_TX:
                for (k = 0; k < n1; k++)
                    for (j=0; j < n2; j ++)
                        temp += access[j];
                ev[i].op = RTE_EVENT_OP_RELEASE; // though I don’t call event_enqueue after this.
                break;
            default:
                printf("invalid q_id:%d\n", ev[i].queue_id);
                break;
            }
        }

The amount of work is parameterized by n1 and n2. This is basically accessing upto 4K bytes contiguously repeatedly. I am using n1=10 and n2=1024.

However profiling with multiple cores shows that a lot of time is being spent in dsw related work.  Specifically:
dsw_port_transmit_buffered: 25%
dsw_port_flush_out_buffers: 9.8%
dsw_port_ctl_process: 6.3%
dsw_port_consider_migration: 6.3%
dsw_event_dequeue_burst: 6.3%
Real worker: 15%
There are a whole bunch of other dsw things taking about 3% each.

As you can see the DSW overhead dominates the scene and very little real work is getting done. Is there some configuration or tuning to be done to get the sort of performance you are seeing with multiple cores?

One consistent observation however is that the dev_credits_on_loan is pretty close to new_event_threshold of 4K.

>     DSW does buffer on enqueue, so large enqueue bursts doesn't improve
>     performance much. They should not decrease performance, unless you go
>     above the configured max burst.
>
>     >      What do you set nb_events_limit to? Good DSW performance much depends on
>     >      the average burst size on the event rings, which in turn is dependent on
>     >      the number of in-flight events. On really high core-count systems you
>     >      might also want to increase DSW_MAX_PORT_OPS_PER_BG_TASK, since it
>     >      effectively puts a limit on the maximum number of events buffered on the
>     >      output buffers.
>     > [VV]:         struct rte_event_dev_config config = {
>     >                          .nb_event_queues = 2,
>     >                          .nb_event_ports = 5,
>     >                          .nb_events_limit  = 4096,
>     >                          .nb_event_queue_flows = 1024,
>     >                          .nb_event_port_dequeue_depth = 128,
>     >                          .nb_event_port_enqueue_depth = 128,
>     >          }; >          struct rte_event_port_conf p_conf = {
>     >                          .dequeue_depth = 64,
>     >                          .enqueue_depth = 64,
>     >                          .new_event_threshold = 1024,
>
>     "new_event_threshold" effectively puts a limit on the number of inflight
>     events. You should increase this to something close to "nb_events_limit".
>
>     >                          .disable_implicit_release = 0,
>     >          };
>     >          struct rte_event_queue_conf q_conf = {
>     >                          .schedule_type = RTE_SCHED_TYPE_ATOMIC,
>     >                          .priority = RTE_EVENT_DEV_PRIORITY_NORMAL,
>     >                          .nb_atomic_flows = 1024,
>     >                          .nb_atomic_order_sequences = 1024,
>     >          };
>
>     >
>     >      In the pipeline simulator all cores produce events initially, and then
>     >      recycles events when the number of in-flight events reach a certain
>     >      threshold (50% of nb_events_limit). A single lcore won't be able to fill
>     >      the pipeline, if you have zero-work stages.
>     > [VV]: I have a single NEW event enqueue thread(0) and a bunch of “dequeue and RELEASE” threads (1-4) – simple case. I have a stats print thread(5) as well. If the 1 enqueue thread is unable to fill the pipeline, what counter would indicate that? I see the contrary effect -- I am tracking the number of times enqueue fails and that number is large.
>     >
>     >
>     There's no counter for failed enqueues, although maybe there should be.
>     "dev_credits_on_load" can be seen as an estimate of how many events are
>     currently inflight in the scheduler. If this number is close to your
>     "new_event_threshold", the pipeline is busy. If it's low, in the
>     couple-of-hundreds range, your pipeline is likely not-so-busy (even
>     idle) because not enough events are being fed into it.
>
>     You can obviously detect failed NEW enqueues in the application as well.
>
>     I'm not sure exactly how much one core can produce, and it obviously
>     depends on what kind of core, but it's certainly a lot lower than
>     "300-400 millions events/s". Maybe something like 40-50 Mevents/s.
>
>     What is your flow id distribution? As in, how many flow ids are you
>     actively using in the events are you feeding the different
>     queues/pipeline stages.
>

I am using a 2.1Ghz Xeon Silver. 16 cores with hyper threading (so 32 threads). As indicated above the performance isn’t increasing with cores. I am using 10000 flow ids using rte_rand()%10000.

>     >      Even though I can't send you the simulator code at this point, I'm happy
>     >      to assist you in any DSW-related endeavors.
>     > [VV]: My program is a simple enough program (nothing proprietary) that I can share. Can I unicast it to you for a quick recommendation?
>     >
>
>     Sure, although I prefer to have any discussions on the mailing list, so
>     other users can learn from your experiences.
>
>     Btw, you really need to get a proper mail user agent, or configure the
>     one you have to quote messages as per normal convention.
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] Application used for DSW event_dev performance testing
  2018-11-27 22:33       ` Venky Venkatesh
@ 2018-11-28 16:55         ` Mattias Rönnblom
  2018-11-28 17:09           ` Mattias Rönnblom
  0 siblings, 1 reply; 7+ messages in thread
From: Mattias Rönnblom @ 2018-11-28 16:55 UTC (permalink / raw)
  To: Venky Venkatesh, dev

On 2018-11-27 23:33, Venky Venkatesh wrote:
> 
> As you can see the DSW overhead dominates the scene and very little real work is getting done. Is there some configuration or tuning to be done to get the sort of performance you are seeing with multiple cores?
>
I can't explain the behavior you are seeing based on the information you 
have supplied.

Attached is a small DSW throughput test program, that I thought might 
help you to find the issue. It works much like the pipeline simulator I 
used when developing the scheduler, but it's a lot simpler. Remember to 
supply "--vdev=event_dsw0".

I ran it on my 12-core Skylake desktop (@2,9 GHz, turbo disabled). With 
zero work and one stage, I get ~640 Mevent/s. For the first few stages 
you add, you'll see a drop in performance. For example, with 3 stages, 
you are at ~310 Mevent/s.

If you increase DSW_MAX_PORT_OUT_BUFFER and DSW_MAX_PORT_OPS_PER_BG_TASK 
you see improvements in efficiency on high-core-count machines. On my 
system, the above goes to 675 M/s for a 1-stage pipeline, and 460 M/s on 
a 3-stage pipeline, if I apply the following changes to dsw_evdev.h:
-#define DSW_MAX_PORT_OUT_BUFFER (32)
+#define DSW_MAX_PORT_OUT_BUFFER (64)

-#define DSW_MAX_PORT_OPS_PER_BG_TASK (128)
+#define DSW_MAX_PORT_OPS_PER_BG_TASK (512)

With 500 clock cycles of dummy work, the per-event overhead is ~16 TSC 
clock cycles/stage and event (i.e. per scheduled event; enqueue + 
dequeue), if my quick-and-dirty benchmark program does the math 
correctly. This also includes the overhead from the benchmark program 
itself.

Overhead with a real application will be higher.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] Application used for DSW event_dev performance testing
  2018-11-28 16:55         ` Mattias Rönnblom
@ 2018-11-28 17:09           ` Mattias Rönnblom
  0 siblings, 0 replies; 7+ messages in thread
From: Mattias Rönnblom @ 2018-11-28 17:09 UTC (permalink / raw)
  To: Venky Venkatesh, dev

On 2018-11-28 17:55, Mattias Rönnblom wrote:
> Attached is a small DSW throughput test program, that I thought might 
> help you to find the issue.

Looks like DPDK's mailman didn't like my attachment.

--

/*
  * dswtp - A simple DSW eventdev scheduler throughput demo program.
  *
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Copyright(c) 2018 Ericsson AB
  * Mattias Rönnblom <mattias.ronnblom@ericsson.com>
  */

#include <inttypes.h>
#include <stdbool.h>
#include <stdio.h>

#include <rte_atomic.h>
#include <rte_cycles.h>
#include <rte_eal.h>
#include <rte_eventdev.h>
#include <rte_lcore.h>
#include <rte_malloc.h>
#include <rte_pause.h>
#include <rte_random.h>

#define EVENT_DEV_ID (0)
#define NUM_IN_FLIGHT_EVENTS (4096)
#define EVENTDEV_MAX_EVENTS (NUM_IN_FLIGHT_EVENTS * 2)
#define EVENTDEV_PORT_NEW_THRESHOLD (NUM_IN_FLIGHT_EVENTS)
#define NUM_FLOWS (1024)

#define ITER_PER_SYNC (32)

#define DEQUEUE_BURST_SIZE (32)
#define ENQUEUE_BURST_SIZE (32)

struct worker_ctx
{
	uint8_t event_dev_id;
	uint8_t event_port_id;

	uint32_t events_to_produce;

	uint16_t num_stages;
	uint32_t stage_work;
	int64_t num_events;

	rte_atomic64_t *events_finished;
} __rte_cache_aligned;

static void
usage(const char *name)
{
	printf("%s <num-stages> <stage-proc-cycles> <num-events[M]>\n", name);
}

static int64_t
sync_event_count(rte_atomic64_t *total_events_finished,
		 uint32_t *finished_since_sync)
{
	if (*finished_since_sync > 0) {
		int64_t total;

		total = rte_atomic64_add_return(total_events_finished,
						*finished_since_sync);

		*finished_since_sync = 0;

		return total;
	} else
		return rte_atomic64_read(total_events_finished);
}

static void
cycle_consume(uint64_t work)
{
	uint64_t deadline;

	if (likely(work == 0))
	    return;

	deadline = rte_get_timer_cycles() + work;
	while (rte_get_timer_cycles() < deadline)
		rte_pause();
}

static int
worker_start(void *arg)
{
	struct worker_ctx *ctx = arg;
	uint8_t dev_id = ctx->event_dev_id;
	uint8_t port_id = ctx->event_port_id;
	uint32_t num_produced = 0;
	uint32_t finished_since_sync = 0;
	uint16_t iter_since_sync = 0;

	for (;;) {
		uint16_t dequeued;
		uint16_t i;
                 uint16_t enqueued = 0;

		if (unlikely(num_produced < ctx->events_to_produce)) {
			struct rte_event ev = {
				.op = RTE_EVENT_OP_NEW,
				.queue_id = 0,
				.sched_type = RTE_SCHED_TYPE_ATOMIC,
				.flow_id = rte_rand() % NUM_FLOWS
			};
			if (rte_event_enqueue_new_burst(dev_id, port_id,
							&ev, 1) == 1)
				num_produced++;
		}

		struct rte_event evs[DEQUEUE_BURST_SIZE];
		dequeued = rte_event_dequeue_burst(dev_id, port_id, evs,
						     DEQUEUE_BURST_SIZE, 0);

		for (i = 0; i < dequeued; i++) {
			struct rte_event *ev = &evs[i];
			uint16_t this_stage = ev->queue_id;
			uint16_t next_stage_num = this_stage + 1;

			cycle_consume(ctx->stage_work);

			ev->op = RTE_EVENT_OP_FORWARD;

			if (next_stage_num == ctx->num_stages) {
				finished_since_sync++;
				ev->queue_id = 0;
			} else
				ev->queue_id = next_stage_num;
		}

                 do {
			uint16_t left = dequeued - enqueued;
			uint16_t burst_size =
				RTE_MIN(left, ENQUEUE_BURST_SIZE);
			enqueued +=
				rte_event_enqueue_burst(dev_id, port_id,
							evs+enqueued,
							burst_size);
                 } while (unlikely(enqueued != dequeued));

		iter_since_sync++;
		if (unlikely(iter_since_sync == ITER_PER_SYNC)) {
			int64_t total =
				sync_event_count(ctx->events_finished,
						 &finished_since_sync);
			if (total >= ctx->num_events)
				break;
			iter_since_sync = 0;
		}
	}

	return 0;
}

static void
setup_event_dev(uint16_t num_stages, struct worker_ctx *worker_ctxs,
		unsigned num_workers)
{
	unsigned i;
	struct rte_event_dev_info dev_info;

	for (i=0; i < num_workers; i++)
		worker_ctxs[i].event_dev_id = EVENT_DEV_ID;

	rte_event_dev_info_get(EVENT_DEV_ID, &dev_info);

	struct rte_event_dev_config config = {
		.nb_event_queues = num_stages,
		.nb_event_ports = num_workers,
		.nb_events_limit = EVENTDEV_MAX_EVENTS,
		.nb_event_queue_flows = dev_info.max_event_queue_flows,
		.nb_event_port_dequeue_depth = DEQUEUE_BURST_SIZE,
		.nb_event_port_enqueue_depth = ENQUEUE_BURST_SIZE
	};

	int rc = rte_event_dev_configure(EVENT_DEV_ID, &config);
	if (rc)
		rte_panic("Failed to configure the event dev\n");

	struct rte_event_queue_conf queue_config = {
		.priority = RTE_EVENT_DEV_PRIORITY_NORMAL,
	};

	for (i=0; i<num_stages; i++) {
		uint8_t queue_id = i;
		queue_config.schedule_type = RTE_SCHED_TYPE_ATOMIC;
		queue_config.nb_atomic_flows = NUM_FLOWS;
		queue_config.nb_atomic_order_sequences = NUM_FLOWS;

		if (rte_event_queue_setup(EVENT_DEV_ID, queue_id,
                                           &queue_config))
			rte_panic("Unable to setup queue %d\n", queue_id);
	}

	struct rte_event_port_conf port_config = {
		.new_event_threshold = EVENTDEV_PORT_NEW_THRESHOLD,
		.dequeue_depth = DEQUEUE_BURST_SIZE,
		.enqueue_depth = ENQUEUE_BURST_SIZE
	};

	for (i=0; i<num_workers; i++) {
		uint8_t event_port_id = i;
		worker_ctxs[i].event_port_id = event_port_id;
		if (rte_event_port_setup(EVENT_DEV_ID, event_port_id,
					 &port_config) < 0)
			rte_panic("Failed to create worker port #%d\n",
				  event_port_id);
	}

	for (i=0; i<num_workers; i++) {
		uint8_t event_port_id = i;
		if (rte_event_port_link(EVENT_DEV_ID, event_port_id,
					NULL, NULL, 0)
		    != (int)num_stages)
			rte_panic("Failed to map worker ports\n");
	}

	if (rte_event_dev_start(EVENT_DEV_ID))
		rte_panic("Unable to start eventdev\n");
}

static double
tsc_to_s(uint64_t tsc)
{
	return (double)tsc/(double)rte_get_timer_hz();
}

int main(int argc, char *argv[])
{
	int rc;
	unsigned i;
	unsigned num_workers;
	uint16_t num_stages;
	uint32_t stage_work;
	int64_t num_events;
	struct worker_ctx *worker_ctxs;
	rte_atomic64_t *events_finished;
	unsigned lcore_id;
	uint64_t start;
	uint64_t latency;
	uint64_t ideal_latency;

	rc = rte_eal_init(argc, argv);
	if (rc < 0)
		rte_panic("Invalid EAL arguments\n");

	argc -= rc;
	argv += rc;

	if (argc != 4) {
		usage(argv[0]);
		exit(EXIT_FAILURE);
	}

	num_stages = atoi(argv[1]);
	stage_work = atoi(argv[2]);
	num_events = atof(argv[3]) * 1e6;

	num_workers = rte_lcore_count();

	worker_ctxs = rte_malloc("worker-ctx",
				 sizeof(struct worker_ctx) * num_workers,
				 RTE_CACHE_LINE_SIZE);
	events_finished = rte_malloc("finished-events", sizeof(rte_atomic64_t),
				   RTE_CACHE_LINE_SIZE);

	if (worker_ctxs == NULL || events_finished == NULL)
		rte_panic("Unable to allocate memory\n");

	rte_atomic64_init(events_finished);

	for (i=0; i<num_workers; i++) {
		struct worker_ctx *w = &worker_ctxs[i];
		*w = (struct worker_ctx) {
			.event_dev_id = EVENT_DEV_ID,
			.event_port_id = i,
			.events_to_produce = NUM_IN_FLIGHT_EVENTS/num_workers,
			.num_stages = num_stages,
			.stage_work = stage_work,
			.num_events = num_events,
			.events_finished = events_finished
		};
	}

	setup_event_dev(num_stages, worker_ctxs, num_workers);

	start = rte_get_timer_cycles();
	rte_compiler_barrier();

	i = 0;
	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
		if (rte_eal_remote_launch(worker_start, &(worker_ctxs[i]),
					  lcore_id))
			rte_panic("Failed to launch worker");
		i++;
	}

	worker_start(&worker_ctxs[num_workers-1]);

	rte_eal_mp_wait_lcore();

	rte_compiler_barrier();
	latency = rte_get_timer_cycles() - start;
	ideal_latency = (stage_work * num_stages * num_events) / num_workers;

	printf("Workers: %d\n", num_workers);
	printf("Stages: %d\n", num_stages);
	printf("Per-stage application processing: %d TSC cycles\n",
                stage_work);
	printf("Events: %"PRId64" M\n", num_events/1000000);
	if (stage_work > 0)
		printf("Ideal latency: %.2f s\n", tsc_to_s(ideal_latency));
	printf("Actual latency: %.2f s\n", tsc_to_s(latency));

	if (stage_work > 0)
		printf("Ideal scheduling rate: %.2f M events/s\n",
		       (num_events*num_stages)/tsc_to_s(ideal_latency)/1e6);
	printf("Actual scheduling rate: %.2f M events/s\n",
	       (num_events*num_stages)/tsc_to_s(latency)/1e6);

	if (stage_work > 0) {
		uint64_t per_stage_oh =
			(latency - ideal_latency) / (num_events * num_stages);
		printf("Scheduling overhead: %"PRId64" TSC cycles/stage\n",
		       per_stage_oh);
	}

	rte_event_dev_stop(EVENT_DEV_ID);

	rte_exit(0, NULL);
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-28 17:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-14 19:16 [dpdk-dev] Application used for DSW event_dev performance testing Venky Venkatesh
2018-11-14 19:41 ` Mattias Rönnblom
2018-11-14 21:56   ` Venky Venkatesh
2018-11-15  5:46     ` Mattias Rönnblom
2018-11-27 22:33       ` Venky Venkatesh
2018-11-28 16:55         ` Mattias Rönnblom
2018-11-28 17:09           ` Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).