* [dpdk-dev] DSW eventdev bug?
@ 2020-02-28 5:57 Venky Venkatesh
2020-03-05 10:22 ` Mattias Rönnblom
0 siblings, 1 reply; 2+ messages in thread
From: Venky Venkatesh @ 2020-02-28 5:57 UTC (permalink / raw)
To: dev
Hi,
This is concerning the DSW PMD for eventdev. In our application we put some
debugs to see that the same ATOMIC flow isn't scheduled to 2 different
cores. Strangely enough we hit it. Not sure if we are missing something OR
it is a bug.
We put some instrumentation code inside DSW and found the following:
- Time T1: flow1 is on core1
- Time T2: core1 decides to migrate flow1 to core2 (using seen_events
etc.)
- Time T3: core1 completes migration of flow1 to core2
- Time T4: core3 sends flow1 pkts (PKT1) to core2
- Time T5: core1 wants to do another migration and again picks flow1!!!
flow1 is owned by core2 at this time
- Time T6-T7: core1 migrates flow1 to core4. core2 happily acknowledges
core1's attempt to migrate flow1 since it just has to pause, unpause and
flush while PKT1 is still in core2. There is no error check to see if
someone else is migrating my flow :)
- Time T8: core3 sends flow1 pkt (PKT2) -- this time to core4. Now we
have the problem of PKT1 in core2 and PKT2 in core4
I think what went wrong was that core1 was using a stale seen_events which
had the old stuff from previous migration. During dsw_port_end_migration we
only reset seen_events_len but not seen_events_idx. So while the latest
events (since the last migration) are towards the end to port->seen_events
(since seen_events_idx is pointing there) all the candidate flow finding
may end up looking from 0 - seen_events_len: thus using stale information
from past migration.
Pls confirm if the above findings are correct.
I tried the following and seems to fix the problem. Pls let me know if this
is the way to fix the issue:
diff -up ./drivers/event/dsw/dsw_event.c.original
./drivers/event/dsw/dsw_event.c
--- ./drivers/event/dsw/dsw_event.c.original 2019-09-05 16:39:52.662976000
-0700
+++ ./drivers/event/dsw/dsw_event.c 2020-02-27 14:49:14.320156000 -0800
@@ -627,6 +641,7 @@ dsw_port_end_migration(struct dsw_evdev
port->migration_state = DSW_MIGRATION_STATE_IDLE;
port->seen_events_len = 0;
+ port->seen_events_idx = 0;
dsw_port_migration_stats(port);
Thanks
-Venky
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [dpdk-dev] DSW eventdev bug?
2020-02-28 5:57 [dpdk-dev] DSW eventdev bug? Venky Venkatesh
@ 2020-03-05 10:22 ` Mattias Rönnblom
0 siblings, 0 replies; 2+ messages in thread
From: Mattias Rönnblom @ 2020-03-05 10:22 UTC (permalink / raw)
To: Venky Venkatesh, dpdk-dev
On 2020-02-28 06:57, Venky Venkatesh wrote:
> Hi,
> This is concerning the DSW PMD for eventdev. In our application we put
> some debugs to see that the same ATOMIC flow isn't scheduled to 2
> different cores. Strangely enough we hit it. Not sure if we are
> missing something OR it is a bug.
>
> We put some instrumentation code inside DSW and found the following:
>
> * Time T1: flow1 is on core1
> * Time T2: core1 decides to migrate flow1 to core2 (using
> seen_events etc.)
> * Time T3: core1 completes migration of flow1 to core2
> * Time T4: core3 sends flow1 pkts (PKT1) to core2
> * Time T5: core1 wants to do another migration and again picks
> flow1!!! flow1 is owned by core2 at this time
> * Time T6-T7: core1 migrates flow1 to core4. core2 happily
> acknowledges core1's attempt to migrate flow1 since it just has to
> pause, unpause and flush while PKT1 is still in core2. There is no
> error check to see if someone else is migrating my flow :)
> * Time T8: core3 sends flow1 pkt (PKT2) -- this time to core4. Now
> we have the problem of PKT1 in core2 and PKT2 in core4
>
>
> I think what went wrong was that core1 was using a stale seen_events
> which had the old stuff from previous migration. During
> dsw_port_end_migration we only reset seen_events_len but not
> seen_events_idx. So while the latest events (since the last migration)
> are towards the end to port->seen_events (since seen_events_idx is
> pointing there) all the candidate flow finding may end up looking from
> 0 - seen_events_len: thus using stale information from past migration.
>
This indeed looks like a bug, although I would propose another fix. The
value of the idx doesn't really matter (just a circular buffer), if you
do not initiate a new migration until before you have seen/recorded
enough events. In fact, that's how I thought it worked already, but I
can't find anything in dsw_port_consider_migration() that aborts the
migration in case not enough events have been seen.
That said, you must have pretty expensive computation if you recorded
less than 128 events in one ms (which is the max DSW migration rate),
and still needs to migrate flows away (since the low is high). That's
probably why we haven't triggered the bug.
I'll make a patch. Thanks!
> Pls confirm if the above findings are correct.
>
> I tried the following and seems to fix the problem. Pls let me know if
> this is the way to fix the issue:
>
> diff -up ./drivers/event/dsw/dsw_event.c.original
> ./drivers/event/dsw/dsw_event.c
>
> --- ./drivers/event/dsw/dsw_event.c.original 2019-09-05
> 16:39:52.662976000 -0700
>
> +++ ./drivers/event/dsw/dsw_event.c 2020-02-27 14:49:14.320156000 -0800
>
> @@ -627,6 +641,7 @@ dsw_port_end_migration(struct dsw_evdev
>
> port->migration_state = DSW_MIGRATION_STATE_IDLE;
>
> port->seen_events_len = 0;
>
> +port->seen_events_idx = 0;
>
> dsw_port_migration_stats(port);
>
>
> Thanks
> -Venky
I
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-03-05 10:22 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-28 5:57 [dpdk-dev] DSW eventdev bug? Venky Venkatesh
2020-03-05 10:22 ` Mattias Rönnblom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).