DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
@ 2019-09-03 14:45 Aaron Conole
  2019-09-04  9:41 ` Van Haaren, Harry
  2019-09-04  9:55 ` David Marchand
  0 siblings, 2 replies; 11+ messages in thread
From: Aaron Conole @ 2019-09-03 14:45 UTC (permalink / raw)
  To: Harry van Haaren; +Cc: dev

Hi Harry,

I noticed as part of series_6218
(http://patches.dpdk.org/project/dpdk/list/?series=6218) that the travis
build had a single failure, in service_autotest but it doesn't seem
related to the series at all.

  https://travis-ci.com/ovsrobot/dpdk/jobs/230358460

Not sure if there's some kind of debugging we can add or look at to help
diagnose failures when they occur.  Do you have time to have a look?

-Aaron

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-09-03 14:45 [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing Aaron Conole
@ 2019-09-04  9:41 ` Van Haaren, Harry
  2019-09-04 10:04   ` David Marchand
  2019-09-04  9:55 ` David Marchand
  1 sibling, 1 reply; 11+ messages in thread
From: Van Haaren, Harry @ 2019-09-04  9:41 UTC (permalink / raw)
  To: Aaron Conole; +Cc: dev

> -----Original Message-----
> From: Aaron Conole [mailto:aconole@redhat.com]
> Sent: Tuesday, September 3, 2019 3:46 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org
> Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
> 
> Hi Harry,

Hey Aaron,

> I noticed as part of series_6218
> (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the travis
> build had a single failure, in service_autotest but it doesn't seem
> related to the series at all.
> 
>   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
> 
> Not sure if there's some kind of debugging we can add or look at to help
> diagnose failures when they occur.  Do you have time to have a look?

Thanks for flagging this.

I've just re-run the unit tests here multiple times to see if I can
reproduce something strange, no luck on reproducing the issue.

Attempted with clang-6 and clang-7 (travis error on clang-7),
still no issues found.

Building with Clang-7 and Shared libs (instead of default static)
still no issues found.

If somebody can reproduce please send an update to here and I'll
attempt to replicate that setup. Right now I can't reproduce the issue.

Regards, -Harry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-09-03 14:45 [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing Aaron Conole
  2019-09-04  9:41 ` Van Haaren, Harry
@ 2019-09-04  9:55 ` David Marchand
  1 sibling, 0 replies; 11+ messages in thread
From: David Marchand @ 2019-09-04  9:55 UTC (permalink / raw)
  To: Aaron Conole; +Cc: Harry van Haaren, dev

On Tue, Sep 3, 2019 at 4:46 PM Aaron Conole <aconole@redhat.com> wrote:
>
> Hi Harry,
>
> I noticed as part of series_6218
> (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the travis
> build had a single failure, in service_autotest but it doesn't seem
> related to the series at all.
>
>   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>
> Not sure if there's some kind of debugging we can add or look at to help
> diagnose failures when they occur.  Do you have time to have a look?

My two cents.

TEST_ASSERT_EQUAL maps to RTE_TEST_ASSERT which only displays failed
assert at DEBUG level.
I'd say we should change this to ERR level.

I can see you can define a RTE_TEST_TRACE_FAILURE macro too before
including "test.h" (which includes rte_test.h).

Never touched this before though.


-- 
David Marchand

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-09-04  9:41 ` Van Haaren, Harry
@ 2019-09-04 10:04   ` David Marchand
  2019-09-04 10:38     ` David Marchand
  0 siblings, 1 reply; 11+ messages in thread
From: David Marchand @ 2019-09-04 10:04 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: Aaron Conole, dev

On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
>
> > -----Original Message-----
> > From: Aaron Conole [mailto:aconole@redhat.com]
> > Sent: Tuesday, September 3, 2019 3:46 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>
> > Cc: dev@dpdk.org
> > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
> >
> > Hi Harry,
>
> Hey Aaron,
>
> > I noticed as part of series_6218
> > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the travis
> > build had a single failure, in service_autotest but it doesn't seem
> > related to the series at all.
> >
> >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
> >
> > Not sure if there's some kind of debugging we can add or look at to help
> > diagnose failures when they occur.  Do you have time to have a look?
>
> Thanks for flagging this.
>
> I've just re-run the unit tests here multiple times to see if I can
> reproduce something strange, no luck on reproducing the issue.
>
> Attempted with clang-6 and clang-7 (travis error on clang-7),
> still no issues found.
>
> Building with Clang-7 and Shared libs (instead of default static)
> still no issues found.
>
> If somebody can reproduce please send an update to here and I'll
> attempt to replicate that setup. Right now I can't reproduce the issue.

You have to be patient, but I caught it on my laptop:

# time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
-c 0-1 ./build-gcc-static/app/test/dpdk-test -l 0-1 >$log 2>&1; grep
-q 'Test OK' $log || break; done; cat $log; rm -f $log)
EAL: Detected 8 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: PCI device 0000:00:1f.6 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:15d7 net_e1000_em
APP: HPET is not enabled, using TSC as default timer
RTE>>service_autotest
 + ------------------------------------------------------- +
 + Test Suite : service core test suite
 + ------------------------------------------------------- +
 + TestCase [ 0] : unregister_all succeeded
 + TestCase [ 1] : service_name succeeded
 + TestCase [ 2] : service_get_by_name succeeded
Service dummy_service Summary
  dummy_service: stats 1    calls 0    cycles 0    avg: 0
Service dummy_service Summary
  dummy_service: stats 0    calls 0    cycles 0    avg: 0
 + TestCase [ 3] : service_dump succeeded
 + TestCase [ 4] : service_attr_get succeeded
 + TestCase [ 5] : service_lcore_attr_get succeeded
 + TestCase [ 6] : service_probe_capability succeeded
 + TestCase [ 7] : service_start_stop succeeded
 + TestCase [ 8] : service_lcore_add_del skipped
 + TestCase [ 9] : service_lcore_start_stop succeeded
 + TestCase [10] : service_lcore_en_dis_able failed
 + TestCase [11] : service_mt_unsafe_poll skipped
 + TestCase [12] : service_mt_safe_poll skipped
 + TestCase [13] : service_app_lcore_mt_safe succeeded
 + TestCase [14] : service_app_lcore_mt_unsafe succeeded
 + TestCase [15] : service_may_be_active succeeded
 + ------------------------------------------------------- +
 + Test Suite Summary
 + Tests Total :       16
 + Tests Skipped :      3
 + Tests Executed :    16
 + Tests Unsupported:   0
 + Tests Passed :      12
 + Tests Failed :       1
 + ------------------------------------------------------- +
Test Failed
RTE>>
real    18m45.959s
user    34m46.441s
sys    0m15.506s


-- 
David Marchand

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-09-04 10:04   ` David Marchand
@ 2019-09-04 10:38     ` David Marchand
  2019-09-04 19:56       ` Aaron Conole
  0 siblings, 1 reply; 11+ messages in thread
From: David Marchand @ 2019-09-04 10:38 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: Aaron Conole, dev

On Wed, Sep 4, 2019 at 12:04 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
> <harry.van.haaren@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Aaron Conole [mailto:aconole@redhat.com]
> > > Sent: Tuesday, September 3, 2019 3:46 PM
> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
> > > Cc: dev@dpdk.org
> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
> > >
> > > Hi Harry,
> >
> > Hey Aaron,
> >
> > > I noticed as part of series_6218
> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the travis
> > > build had a single failure, in service_autotest but it doesn't seem
> > > related to the series at all.
> > >
> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
> > >
> > > Not sure if there's some kind of debugging we can add or look at to help
> > > diagnose failures when they occur.  Do you have time to have a look?
> >
> > Thanks for flagging this.
> >
> > I've just re-run the unit tests here multiple times to see if I can
> > reproduce something strange, no luck on reproducing the issue.
> >
> > Attempted with clang-6 and clang-7 (travis error on clang-7),
> > still no issues found.
> >
> > Building with Clang-7 and Shared libs (instead of default static)
> > still no issues found.
> >
> > If somebody can reproduce please send an update to here and I'll
> > attempt to replicate that setup. Right now I can't reproduce the issue.
>
> You have to be patient, but I caught it on my laptop:
>

Ok, and now with the logs:


# time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
-c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
$log)
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Detected lcore 2 as core 2 on socket 0
EAL: Detected lcore 3 as core 3 on socket 0
EAL: Detected lcore 4 as core 0 on socket 0
EAL: Detected lcore 5 as core 1 on socket 0
EAL: Detected lcore 6 as core 2 on socket 0
EAL: Detected lcore 7 as core 3 on socket 0
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 8 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
dpaa: rte_dpaa_bus_scan():  >>
EAL: DPAA Bus not present. Skipping.
fslmc: fslmc_get_container_group(): DPAA2: DPRC not available
fslmc: rte_fslmc_scan(): FSLMC Bus Not Available. Skipping (-22)
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO PCI modules not loaded
EAL: Bus dpaa_bus wants IOVA as 'DC'
EAL: Bus fslmc wants IOVA as 'DC'
EAL: Bus pci wants IOVA as 'DC'
EAL: Buses did not request a specific IOVA mode.
EAL: IOMMU is not available, selecting IOVA as PA mode.
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Ask a virtual area of 0x2e000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x2e000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 4096
EAL: Detected memory type: socket_id:0 hugepage_sz:2097152
EAL: Creating 4 segment lists: n_segs:8192 socket_id:0 hugepage_sz:2097152
EAL: Ask a virtual area of 0x61000 bytes
EAL: Virtual area found at 0x10002e000 (size = 0x61000)
EAL: Memseg list allocated: 0x800kB at socket 0
EAL: Ask a virtual area of 0x400000000 bytes
EAL: Virtual area found at 0x100200000 (size = 0x400000000)
EAL: Ask a virtual area of 0x61000 bytes
EAL: Virtual area found at 0x500200000 (size = 0x61000)
EAL: Memseg list allocated: 0x800kB at socket 0
EAL: Ask a virtual area of 0x400000000 bytes
EAL: Virtual area found at 0x500400000 (size = 0x400000000)
EAL: Ask a virtual area of 0x61000 bytes
EAL: Virtual area found at 0x900400000 (size = 0x61000)
EAL: Memseg list allocated: 0x800kB at socket 0
EAL: Ask a virtual area of 0x400000000 bytes
EAL: Virtual area found at 0x900600000 (size = 0x400000000)
EAL: Ask a virtual area of 0x61000 bytes
EAL: Virtual area found at 0xd00600000 (size = 0x61000)
EAL: Memseg list allocated: 0x800kB at socket 0
EAL: Ask a virtual area of 0x400000000 bytes
EAL: Virtual area found at 0xd00800000 (size = 0x400000000)
EAL: TSC frequency is ~2100000 KHz
EAL: Master lcore 0 is ready (tid=7efc3d3a4000;cpuset=[0])
EAL: lcore 1 is ready (tid=7efc39aec700;cpuset=[1])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 2MB
EAL: PCI device 0000:00:1f.6 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:15d7 net_e1000_em
EAL:   Not managed by a supported kernel driver, skipped
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
APP: HPET is not enabled, using TSC as default timer
RTE>>service_autotest
 + ------------------------------------------------------- +
 + Test Suite : service core test suite
 + ------------------------------------------------------- +
 + TestCase [ 0] : unregister_all succeeded
 + TestCase [ 1] : service_name succeeded
 + TestCase [ 2] : service_get_by_name succeeded
Service dummy_service Summary
  dummy_service: stats 1    calls 0    cycles 0    avg: 0
Service dummy_service Summary
  dummy_service: stats 0    calls 0    cycles 0    avg: 0
 + TestCase [ 3] : service_dump succeeded
 + TestCase [ 4] : service_attr_get succeeded
 + TestCase [ 5] : service_lcore_attr_get succeeded
 + TestCase [ 6] : service_probe_capability succeeded
 + TestCase [ 7] : service_start_stop succeeded
 + TestCase [ 8] : service_lcore_add_del skipped
 + TestCase [ 9] : service_lcore_start_stop succeeded
EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
core function call had no effect.
 + TestCase [10] : service_lcore_en_dis_able failed
 + TestCase [11] : service_mt_unsafe_poll skipped
 + TestCase [12] : service_mt_safe_poll skipped
 + TestCase [13] : service_app_lcore_mt_safe succeeded
 + TestCase [14] : service_app_lcore_mt_unsafe succeeded
 + TestCase [15] : service_may_be_active succeeded
 + ------------------------------------------------------- +
 + Test Suite Summary
 + Tests Total :       16
 + Tests Skipped :      3
 + Tests Executed :    16
 + Tests Unsupported:   0
 + Tests Passed :      12
 + Tests Failed :       1
 + ------------------------------------------------------- +
Test Failed
RTE>>EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was shrunk by 2MB

real    2m42.884s
user    5m1.902s
sys    0m2.208s



-- 
David Marchand

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-09-04 10:38     ` David Marchand
@ 2019-09-04 19:56       ` Aaron Conole
  2019-10-07  9:50         ` Van Haaren, Harry
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Conole @ 2019-09-04 19:56 UTC (permalink / raw)
  To: David Marchand; +Cc: Van Haaren, Harry, dev

David Marchand <david.marchand@redhat.com> writes:

> On Wed, Sep 4, 2019 at 12:04 PM David Marchand
> <david.marchand@redhat.com> wrote:
>>
>> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
>> <harry.van.haaren@intel.com> wrote:
>> >
>> > > -----Original Message-----
>> > > From: Aaron Conole [mailto:aconole@redhat.com]
>> > > Sent: Tuesday, September 3, 2019 3:46 PM
>> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
>> > > Cc: dev@dpdk.org
>> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
>> > >
>> > > Hi Harry,
>> >
>> > Hey Aaron,
>> >
>> > > I noticed as part of series_6218
>> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the travis
>> > > build had a single failure, in service_autotest but it doesn't seem
>> > > related to the series at all.
>> > >
>> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>> > >
>> > > Not sure if there's some kind of debugging we can add or look at to help
>> > > diagnose failures when they occur.  Do you have time to have a look?
>> >
>> > Thanks for flagging this.
>> >
>> > I've just re-run the unit tests here multiple times to see if I can
>> > reproduce something strange, no luck on reproducing the issue.
>> >
>> > Attempted with clang-6 and clang-7 (travis error on clang-7),
>> > still no issues found.
>> >
>> > Building with Clang-7 and Shared libs (instead of default static)
>> > still no issues found.
>> >
>> > If somebody can reproduce please send an update to here and I'll
>> > attempt to replicate that setup. Right now I can't reproduce the issue.
>>
>> You have to be patient, but I caught it on my laptop:
>>
>
> Ok, and now with the logs:
>
>
> # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
> -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
> $log)
> EAL: Detected lcore 0 as core 0 on socket 0
> EAL: Detected lcore 1 as core 1 on socket 0
> EAL: Detected lcore 2 as core 2 on socket 0
> EAL: Detected lcore 3 as core 3 on socket 0
> EAL: Detected lcore 4 as core 0 on socket 0
> EAL: Detected lcore 5 as core 1 on socket 0
> EAL: Detected lcore 6 as core 2 on socket 0
> EAL: Detected lcore 7 as core 3 on socket 0
> EAL: Support maximum 128 logical core(s) by configuration.
> EAL: Detected 8 lcore(s)
> EAL: Detected 1 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> dpaa: rte_dpaa_bus_scan():  >>
> EAL: DPAA Bus not present. Skipping.
> fslmc: fslmc_get_container_group(): DPAA2: DPRC not available
> fslmc: rte_fslmc_scan(): FSLMC Bus Not Available. Skipping (-22)
> EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
> EAL: VFIO PCI modules not loaded
> EAL: Bus dpaa_bus wants IOVA as 'DC'
> EAL: Bus fslmc wants IOVA as 'DC'
> EAL: Bus pci wants IOVA as 'DC'
> EAL: Buses did not request a specific IOVA mode.
> EAL: IOMMU is not available, selecting IOVA as PA mode.
> EAL: Selected IOVA mode 'PA'
> EAL: No available hugepages reported in hugepages-1048576kB
> EAL: Probing VFIO support...
> EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
> EAL: VFIO modules not loaded, skipping VFIO support...
> EAL: Ask a virtual area of 0x2e000 bytes
> EAL: Virtual area found at 0x100000000 (size = 0x2e000)
> EAL: Setting up physically contiguous memory...
> EAL: Setting maximum number of open files to 4096
> EAL: Detected memory type: socket_id:0 hugepage_sz:2097152
> EAL: Creating 4 segment lists: n_segs:8192 socket_id:0 hugepage_sz:2097152
> EAL: Ask a virtual area of 0x61000 bytes
> EAL: Virtual area found at 0x10002e000 (size = 0x61000)
> EAL: Memseg list allocated: 0x800kB at socket 0
> EAL: Ask a virtual area of 0x400000000 bytes
> EAL: Virtual area found at 0x100200000 (size = 0x400000000)
> EAL: Ask a virtual area of 0x61000 bytes
> EAL: Virtual area found at 0x500200000 (size = 0x61000)
> EAL: Memseg list allocated: 0x800kB at socket 0
> EAL: Ask a virtual area of 0x400000000 bytes
> EAL: Virtual area found at 0x500400000 (size = 0x400000000)
> EAL: Ask a virtual area of 0x61000 bytes
> EAL: Virtual area found at 0x900400000 (size = 0x61000)
> EAL: Memseg list allocated: 0x800kB at socket 0
> EAL: Ask a virtual area of 0x400000000 bytes
> EAL: Virtual area found at 0x900600000 (size = 0x400000000)
> EAL: Ask a virtual area of 0x61000 bytes
> EAL: Virtual area found at 0xd00600000 (size = 0x61000)
> EAL: Memseg list allocated: 0x800kB at socket 0
> EAL: Ask a virtual area of 0x400000000 bytes
> EAL: Virtual area found at 0xd00800000 (size = 0x400000000)
> EAL: TSC frequency is ~2100000 KHz
> EAL: Master lcore 0 is ready (tid=7efc3d3a4000;cpuset=[0])
> EAL: lcore 1 is ready (tid=7efc39aec700;cpuset=[1])
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 2MB
> EAL: PCI device 0000:00:1f.6 on NUMA socket -1
> EAL:   Invalid NUMA socket, default to 0
> EAL:   probe driver: 8086:15d7 net_e1000_em
> EAL:   Not managed by a supported kernel driver, skipped
> EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
> APP: HPET is not enabled, using TSC as default timer
> RTE>>service_autotest
>  + ------------------------------------------------------- +
>  + Test Suite : service core test suite
>  + ------------------------------------------------------- +
>  + TestCase [ 0] : unregister_all succeeded
>  + TestCase [ 1] : service_name succeeded
>  + TestCase [ 2] : service_get_by_name succeeded
> Service dummy_service Summary
>   dummy_service: stats 1    calls 0    cycles 0    avg: 0
> Service dummy_service Summary
>   dummy_service: stats 0    calls 0    cycles 0    avg: 0
>  + TestCase [ 3] : service_dump succeeded
>  + TestCase [ 4] : service_attr_get succeeded
>  + TestCase [ 5] : service_lcore_attr_get succeeded
>  + TestCase [ 6] : service_probe_capability succeeded
>  + TestCase [ 7] : service_start_stop succeeded
>  + TestCase [ 8] : service_lcore_add_del skipped
>  + TestCase [ 9] : service_lcore_start_stop succeeded
> EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
> core function call had no effect.
>  + TestCase [10] : service_lcore_en_dis_able failed
>  + TestCase [11] : service_mt_unsafe_poll skipped
>  + TestCase [12] : service_mt_safe_poll skipped
>  + TestCase [13] : service_app_lcore_mt_safe succeeded
>  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>  + TestCase [15] : service_may_be_active succeeded
>  + ------------------------------------------------------- +
>  + Test Suite Summary
>  + Tests Total :       16
>  + Tests Skipped :      3
>  + Tests Executed :    16
>  + Tests Unsupported:   0
>  + Tests Passed :      12
>  + Tests Failed :       1
>  + ------------------------------------------------------- +
> Test Failed
> RTE>>EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was shrunk by 2MB
>
> real    2m42.884s
> user    5m1.902s
> sys    0m2.208s

I can confirm - takes about 1m to fail.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-09-04 19:56       ` Aaron Conole
@ 2019-10-07  9:50         ` Van Haaren, Harry
  2019-10-07 12:38           ` Aaron Conole
  0 siblings, 1 reply; 11+ messages in thread
From: Van Haaren, Harry @ 2019-10-07  9:50 UTC (permalink / raw)
  To: Aaron Conole, David Marchand; +Cc: dev

> -----Original Message-----
> From: Aaron Conole [mailto:aconole@redhat.com]
> Sent: Wednesday, September 4, 2019 8:56 PM
> To: David Marchand <david.marchand@redhat.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
> failing
> 
> David Marchand <david.marchand@redhat.com> writes:
> 
> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand
> > <david.marchand@redhat.com> wrote:
> >>
> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
> >> <harry.van.haaren@intel.com> wrote:
> >> >
> >> > > -----Original Message-----
> >> > > From: Aaron Conole [mailto:aconole@redhat.com]
> >> > > Sent: Tuesday, September 3, 2019 3:46 PM
> >> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
> >> > > Cc: dev@dpdk.org
> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
> >> > >
> >> > > Hi Harry,
> >> >
> >> > Hey Aaron,
> >> >
> >> > > I noticed as part of series_6218
> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the
> travis
> >> > > build had a single failure, in service_autotest but it doesn't seem
> >> > > related to the series at all.
> >> > >
> >> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
> >> > >
> >> > > Not sure if there's some kind of debugging we can add or look at to
> help
> >> > > diagnose failures when they occur.  Do you have time to have a look?
> >> >
> >> > Thanks for flagging this.
> >> >
> >> > I've just re-run the unit tests here multiple times to see if I can
> >> > reproduce something strange, no luck on reproducing the issue.
> >> >
> >> > Attempted with clang-6 and clang-7 (travis error on clang-7),
> >> > still no issues found.
> >> >
> >> > Building with Clang-7 and Shared libs (instead of default static)
> >> > still no issues found.
> >> >
> >> > If somebody can reproduce please send an update to here and I'll
> >> > attempt to replicate that setup. Right now I can't reproduce the issue.
> >>
> >> You have to be patient, but I caught it on my laptop:
> >>
> >
> > Ok, and now with the logs:
> >
> >
> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
> > $log)
> > EAL: Detected lcore 0 as core 0 on socket 0
> > EAL: Detected lcore 1 as core 1 on socket 0
> > EAL: Detected lcore 2 as core 2 on socket 0
<snip verbose logs>
> > EAL:   probe driver: 8086:15d7 net_e1000_em
> > EAL:   Not managed by a supported kernel driver, skipped
> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
> > APP: HPET is not enabled, using TSC as default timer
> > RTE>>service_autotest
> >  + ------------------------------------------------------- +
> >  + Test Suite : service core test suite
> >  + ------------------------------------------------------- +
> >  + TestCase [ 0] : unregister_all succeeded
> >  + TestCase [ 1] : service_name succeeded
> >  + TestCase [ 2] : service_get_by_name succeeded
> > Service dummy_service Summary
> >   dummy_service: stats 1    calls 0    cycles 0    avg: 0
> > Service dummy_service Summary
> >   dummy_service: stats 0    calls 0    cycles 0    avg: 0
> >  + TestCase [ 3] : service_dump succeeded
> >  + TestCase [ 4] : service_attr_get succeeded
> >  + TestCase [ 5] : service_lcore_attr_get succeeded
> >  + TestCase [ 6] : service_probe_capability succeeded
> >  + TestCase [ 7] : service_start_stop succeeded
> >  + TestCase [ 8] : service_lcore_add_del skipped
> >  + TestCase [ 9] : service_lcore_start_stop succeeded
> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
> > core function call had no effect.
> >  + TestCase [10] : service_lcore_en_dis_able failed
> >  + TestCase [11] : service_mt_unsafe_poll skipped
> >  + TestCase [12] : service_mt_safe_poll skipped
> >  + TestCase [13] : service_app_lcore_mt_safe succeeded
> >  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
> >  + TestCase [15] : service_may_be_active succeeded
> >  + ------------------------------------------------------- +
> >  + Test Suite Summary
> >  + Tests Total :       16
> >  + Tests Skipped :      3
> >  + Tests Executed :    16
> >  + Tests Unsupported:   0
> >  + Tests Passed :      12
> >  + Tests Failed :       1
> >  + ------------------------------------------------------- +
> > Test Failed
> > RTE>>EAL: request: mp_malloc_sync
> > EAL: Heap on socket 0 was shrunk by 2MB
> >
> > real    2m42.884s
> > user    5m1.902s
> > sys    0m2.208s
> 
> I can confirm - takes about 1m to fail.


Hi Aaron and David,

I've been attempting to reproduce this, still no errors here.

Given the nature of service-cores, and the difficulty to reproduce
here this feels like a race-condition - one that may not exist in all
binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).

I'm using Meson to build, so reproducing using this instead of the command
as provided above. There should be no difference in reproducing due to this:

$ meson test service_autotest --repeat 50

1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
...
1/1 DPDK:fast-tests / service_autotest      OK       3.84 s

OK:        50
FAIL:       0
SKIP:       0
TIMEOUT:    0

I'll keep it running for a few hours but I have little faith if it only
takes 1 minute on your machines...

Regards, -Harry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-10-07  9:50         ` Van Haaren, Harry
@ 2019-10-07 12:38           ` Aaron Conole
  2019-10-14 14:53             ` Aaron Conole
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Conole @ 2019-10-07 12:38 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: David Marchand, dev

"Van Haaren, Harry" <harry.van.haaren@intel.com> writes:

>> -----Original Message-----
>> From: Aaron Conole [mailto:aconole@redhat.com]
>> Sent: Wednesday, September 4, 2019 8:56 PM
>> To: David Marchand <david.marchand@redhat.com>
>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
>> failing
>> 
>> David Marchand <david.marchand@redhat.com> writes:
>> 
>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand
>> > <david.marchand@redhat.com> wrote:
>> >>
>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
>> >> <harry.van.haaren@intel.com> wrote:
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: Aaron Conole [mailto:aconole@redhat.com]
>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM
>> >> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
>> >> > > Cc: dev@dpdk.org
>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
>> >> > >
>> >> > > Hi Harry,
>> >> >
>> >> > Hey Aaron,
>> >> >
>> >> > > I noticed as part of series_6218
>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the
>> travis
>> >> > > build had a single failure, in service_autotest but it doesn't seem
>> >> > > related to the series at all.
>> >> > >
>> >> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>> >> > >
>> >> > > Not sure if there's some kind of debugging we can add or look at to
>> help
>> >> > > diagnose failures when they occur.  Do you have time to have a look?
>> >> >
>> >> > Thanks for flagging this.
>> >> >
>> >> > I've just re-run the unit tests here multiple times to see if I can
>> >> > reproduce something strange, no luck on reproducing the issue.
>> >> >
>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7),
>> >> > still no issues found.
>> >> >
>> >> > Building with Clang-7 and Shared libs (instead of default static)
>> >> > still no issues found.
>> >> >
>> >> > If somebody can reproduce please send an update to here and I'll
>> >> > attempt to replicate that setup. Right now I can't reproduce the issue.
>> >>
>> >> You have to be patient, but I caught it on my laptop:
>> >>
>> >
>> > Ok, and now with the logs:
>> >
>> >
>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
>> > $log)
>> > EAL: Detected lcore 0 as core 0 on socket 0
>> > EAL: Detected lcore 1 as core 1 on socket 0
>> > EAL: Detected lcore 2 as core 2 on socket 0
> <snip verbose logs>
>> > EAL:   probe driver: 8086:15d7 net_e1000_em
>> > EAL:   Not managed by a supported kernel driver, skipped
>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
>> > APP: HPET is not enabled, using TSC as default timer
>> > RTE>>service_autotest
>> >  + ------------------------------------------------------- +
>> >  + Test Suite : service core test suite
>> >  + ------------------------------------------------------- +
>> >  + TestCase [ 0] : unregister_all succeeded
>> >  + TestCase [ 1] : service_name succeeded
>> >  + TestCase [ 2] : service_get_by_name succeeded
>> > Service dummy_service Summary
>> >   dummy_service: stats 1    calls 0    cycles 0    avg: 0
>> > Service dummy_service Summary
>> >   dummy_service: stats 0    calls 0    cycles 0    avg: 0
>> >  + TestCase [ 3] : service_dump succeeded
>> >  + TestCase [ 4] : service_attr_get succeeded
>> >  + TestCase [ 5] : service_lcore_attr_get succeeded
>> >  + TestCase [ 6] : service_probe_capability succeeded
>> >  + TestCase [ 7] : service_start_stop succeeded
>> >  + TestCase [ 8] : service_lcore_add_del skipped
>> >  + TestCase [ 9] : service_lcore_start_stop succeeded
>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
>> > core function call had no effect.
>> >  + TestCase [10] : service_lcore_en_dis_able failed
>> >  + TestCase [11] : service_mt_unsafe_poll skipped
>> >  + TestCase [12] : service_mt_safe_poll skipped
>> >  + TestCase [13] : service_app_lcore_mt_safe succeeded
>> >  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>> >  + TestCase [15] : service_may_be_active succeeded
>> >  + ------------------------------------------------------- +
>> >  + Test Suite Summary
>> >  + Tests Total :       16
>> >  + Tests Skipped :      3
>> >  + Tests Executed :    16
>> >  + Tests Unsupported:   0
>> >  + Tests Passed :      12
>> >  + Tests Failed :       1
>> >  + ------------------------------------------------------- +
>> > Test Failed
>> > RTE>>EAL: request: mp_malloc_sync
>> > EAL: Heap on socket 0 was shrunk by 2MB
>> >
>> > real    2m42.884s
>> > user    5m1.902s
>> > sys    0m2.208s
>> 
>> I can confirm - takes about 1m to fail.
>
>
> Hi Aaron and David,
>
> I've been attempting to reproduce this, still no errors here.
>
> Given the nature of service-cores, and the difficulty to reproduce
> here this feels like a race-condition - one that may not exist in all
> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
>
> I'm using Meson to build, so reproducing using this instead of the command
> as provided above. There should be no difference in reproducing due to this:

The command runs far more iterations than meson does (I think).

I still see it periodically occur in the travis environment.

I did see at least one missing memory barrier (I believe).  Please
review the following code change (and if you agree I can submit it
formally):

-----
--- a/lib/librte_eal/common/eal_common_launch.c
+++ b/lib/librte_eal/common/eal_common_launch.c
@@ -21,8 +21,10 @@
 int
 rte_eal_wait_lcore(unsigned slave_id)
 {
-       if (lcore_config[slave_id].state == WAIT)
+       if (lcore_config[slave_id].state == WAIT) {
+               rte_rmb();
                return 0;
+       }
 
        while (lcore_config[slave_id].state != WAIT &&
               lcore_config[slave_id].state != FINISHED)
-----

This is because in lib/librte_eal/linux/eal/eal_thread.c:

-----
		/* when a service core returns, it should go directly to WAIT
		 * state, because the application will not lcore_wait() for it.
		 */
		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
			lcore_config[lcore_id].state = WAIT;
		else
			lcore_config[lcore_id].state = FINISHED;
-----

NOTE that the service core skips the rte_eal_wait_lcore() code from
making the FINISHED->WAIT transition.  So I think at least that read
barrier will be needed (maybe I miss the pairing, though?).

Additionally, I'm wondering if there is an additional write or sync
barrier needed to ensure that some of the transitions are properly
recorded when using lcore as a service lcore function.  The fact that
this only happens occasionally tells me that it's either a race (which
is possible... because the variable update in the test might not be
sync'd across cores or something), or some other missing
synchronization.

> $ meson test service_autotest --repeat 50
>
> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
> ...
> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
>
> OK:        50
> FAIL:       0
> SKIP:       0
> TIMEOUT:    0
>
> I'll keep it running for a few hours but I have little faith if it only
> takes 1 minute on your machines...

Please try the flat command.

> Regards, -Harry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-10-07 12:38           ` Aaron Conole
@ 2019-10-14 14:53             ` Aaron Conole
  2019-10-14 16:48               ` Van Haaren, Harry
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Conole @ 2019-10-14 14:53 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: David Marchand, dev

Aaron Conole <aconole@redhat.com> writes:

> "Van Haaren, Harry" <harry.van.haaren@intel.com> writes:
>
>>> -----Original Message-----
>>> From: Aaron Conole [mailto:aconole@redhat.com]
>>> Sent: Wednesday, September 4, 2019 8:56 PM
>>> To: David Marchand <david.marchand@redhat.com>
>>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
>>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
>>> failing
>>> 
>>> David Marchand <david.marchand@redhat.com> writes:
>>> 
>>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand
>>> > <david.marchand@redhat.com> wrote:
>>> >>
>>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
>>> >> <harry.van.haaren@intel.com> wrote:
>>> >> >
>>> >> > > -----Original Message-----
>>> >> > > From: Aaron Conole [mailto:aconole@redhat.com]
>>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM
>>> >> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
>>> >> > > Cc: dev@dpdk.org
>>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
>>> >> > >
>>> >> > > Hi Harry,
>>> >> >
>>> >> > Hey Aaron,
>>> >> >
>>> >> > > I noticed as part of series_6218
>>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the
>>> travis
>>> >> > > build had a single failure, in service_autotest but it doesn't seem
>>> >> > > related to the series at all.
>>> >> > >
>>> >> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>>> >> > >
>>> >> > > Not sure if there's some kind of debugging we can add or look at to
>>> help
>>> >> > > diagnose failures when they occur.  Do you have time to have a look?
>>> >> >
>>> >> > Thanks for flagging this.
>>> >> >
>>> >> > I've just re-run the unit tests here multiple times to see if I can
>>> >> > reproduce something strange, no luck on reproducing the issue.
>>> >> >
>>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7),
>>> >> > still no issues found.
>>> >> >
>>> >> > Building with Clang-7 and Shared libs (instead of default static)
>>> >> > still no issues found.
>>> >> >
>>> >> > If somebody can reproduce please send an update to here and I'll
>>> >> > attempt to replicate that setup. Right now I can't reproduce the issue.
>>> >>
>>> >> You have to be patient, but I caught it on my laptop:
>>> >>
>>> >
>>> > Ok, and now with the logs:
>>> >
>>> >
>>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
>>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
>>> > $log)
>>> > EAL: Detected lcore 0 as core 0 on socket 0
>>> > EAL: Detected lcore 1 as core 1 on socket 0
>>> > EAL: Detected lcore 2 as core 2 on socket 0
>> <snip verbose logs>
>>> > EAL:   probe driver: 8086:15d7 net_e1000_em
>>> > EAL:   Not managed by a supported kernel driver, skipped
>>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
>>> > APP: HPET is not enabled, using TSC as default timer
>>> > RTE>>service_autotest
>>> >  + ------------------------------------------------------- +
>>> >  + Test Suite : service core test suite
>>> >  + ------------------------------------------------------- +
>>> >  + TestCase [ 0] : unregister_all succeeded
>>> >  + TestCase [ 1] : service_name succeeded
>>> >  + TestCase [ 2] : service_get_by_name succeeded
>>> > Service dummy_service Summary
>>> >   dummy_service: stats 1    calls 0    cycles 0    avg: 0
>>> > Service dummy_service Summary
>>> >   dummy_service: stats 0    calls 0    cycles 0    avg: 0
>>> >  + TestCase [ 3] : service_dump succeeded
>>> >  + TestCase [ 4] : service_attr_get succeeded
>>> >  + TestCase [ 5] : service_lcore_attr_get succeeded
>>> >  + TestCase [ 6] : service_probe_capability succeeded
>>> >  + TestCase [ 7] : service_start_stop succeeded
>>> >  + TestCase [ 8] : service_lcore_add_del skipped
>>> >  + TestCase [ 9] : service_lcore_start_stop succeeded
>>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
>>> > core function call had no effect.
>>> >  + TestCase [10] : service_lcore_en_dis_able failed
>>> >  + TestCase [11] : service_mt_unsafe_poll skipped
>>> >  + TestCase [12] : service_mt_safe_poll skipped
>>> >  + TestCase [13] : service_app_lcore_mt_safe succeeded
>>> >  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>>> >  + TestCase [15] : service_may_be_active succeeded
>>> >  + ------------------------------------------------------- +
>>> >  + Test Suite Summary
>>> >  + Tests Total :       16
>>> >  + Tests Skipped :      3
>>> >  + Tests Executed :    16
>>> >  + Tests Unsupported:   0
>>> >  + Tests Passed :      12
>>> >  + Tests Failed :       1
>>> >  + ------------------------------------------------------- +
>>> > Test Failed
>>> > RTE>>EAL: request: mp_malloc_sync
>>> > EAL: Heap on socket 0 was shrunk by 2MB
>>> >
>>> > real    2m42.884s
>>> > user    5m1.902s
>>> > sys    0m2.208s
>>> 
>>> I can confirm - takes about 1m to fail.
>>
>>
>> Hi Aaron and David,
>>
>> I've been attempting to reproduce this, still no errors here.
>>
>> Given the nature of service-cores, and the difficulty to reproduce
>> here this feels like a race-condition - one that may not exist in all
>> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
>>
>> I'm using Meson to build, so reproducing using this instead of the command
>> as provided above. There should be no difference in reproducing due to this:
>
> The command runs far more iterations than meson does (I think).
>
> I still see it periodically occur in the travis environment.
>
> I did see at least one missing memory barrier (I believe).  Please
> review the following code change (and if you agree I can submit it
> formally):
>
> -----
> --- a/lib/librte_eal/common/eal_common_launch.c
> +++ b/lib/librte_eal/common/eal_common_launch.c
> @@ -21,8 +21,10 @@
>  int
>  rte_eal_wait_lcore(unsigned slave_id)
>  {
> -       if (lcore_config[slave_id].state == WAIT)
> +       if (lcore_config[slave_id].state == WAIT) {
> +               rte_rmb();
>                 return 0;
> +       }
>  
>         while (lcore_config[slave_id].state != WAIT &&
>                lcore_config[slave_id].state != FINISHED)
> -----
>
> This is because in lib/librte_eal/linux/eal/eal_thread.c:
>
> -----
> 		/* when a service core returns, it should go directly to WAIT
> 		 * state, because the application will not lcore_wait() for it.
> 		 */
> 		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> 			lcore_config[lcore_id].state = WAIT;
> 		else
> 			lcore_config[lcore_id].state = FINISHED;
> -----
>
> NOTE that the service core skips the rte_eal_wait_lcore() code from
> making the FINISHED->WAIT transition.  So I think at least that read
> barrier will be needed (maybe I miss the pairing, though?).
>
> Additionally, I'm wondering if there is an additional write or sync
> barrier needed to ensure that some of the transitions are properly
> recorded when using lcore as a service lcore function.  The fact that
> this only happens occasionally tells me that it's either a race (which
> is possible... because the variable update in the test might not be
> sync'd across cores or something), or some other missing
> synchronization.
>
>> $ meson test service_autotest --repeat 50
>>
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
>> ...
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
>>
>> OK:        50
>> FAIL:       0
>> SKIP:       0
>> TIMEOUT:    0
>>
>> I'll keep it running for a few hours but I have little faith if it only
>> takes 1 minute on your machines...
>
> Please try the flat command.

Not sure if you've had any time to look at this.

I think there's a change we can make, but not sure about how it fits in
the overall service lcore design.

The proposal is to use a pthread_cond variable which blocks the thread
requesting the service function to run.  The service function merely
sets the condition.  The requesting thread does a timed wait (up to 5s?)
and if the timeout is exceeded can throw an error.  Otherwise, it will
unblock and can assume that the test passes.  WDYT?  I think it works
better than the racy code in the test case for now.

>> Regards, -Harry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-10-14 14:53             ` Aaron Conole
@ 2019-10-14 16:48               ` Van Haaren, Harry
  2019-10-15 16:42                 ` Van Haaren, Harry
  0 siblings, 1 reply; 11+ messages in thread
From: Van Haaren, Harry @ 2019-10-14 16:48 UTC (permalink / raw)
  To: Aaron Conole; +Cc: David Marchand, dev

> -----Original Message-----
> From: Aaron Conole [mailto:aconole@redhat.com]
> Sent: Monday, October 14, 2019 3:54 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: David Marchand <david.marchand@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> service_autotest failing
> 
> Aaron Conole <aconole@redhat.com> writes:
> 
> > "Van Haaren, Harry" <harry.van.haaren@intel.com> writes:
> >
> >>> -----Original Message-----
> >>> From: Aaron Conole [mailto:aconole@redhat.com]
> >>> Sent: Wednesday, September 4, 2019 8:56 PM
> >>> To: David Marchand <david.marchand@redhat.com>
> >>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> service_autotest
> >>> failing
<snip lots of backlog>
> >>> > real    2m42.884s
> >>> > user    5m1.902s
> >>> > sys    0m2.208s
> >>>
> >>> I can confirm - takes about 1m to fail.
> >>
> >>
> >> Hi Aaron and David,
> >>
> >> I've been attempting to reproduce this, still no errors here.
> >>
> >> Given the nature of service-cores, and the difficulty to reproduce
> >> here this feels like a race-condition - one that may not exist in all
> >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
> >>
> >> I'm using Meson to build, so reproducing using this instead of the
> command
> >> as provided above. There should be no difference in reproducing due to
> this:
> >
> > The command runs far more iterations than meson does (I think).
> >
> > I still see it periodically occur in the travis environment.
> >
> > I did see at least one missing memory barrier (I believe).  Please
> > review the following code change (and if you agree I can submit it
> > formally):
> >
> > -----
> > --- a/lib/librte_eal/common/eal_common_launch.c
> > +++ b/lib/librte_eal/common/eal_common_launch.c
> > @@ -21,8 +21,10 @@
> >  int
> >  rte_eal_wait_lcore(unsigned slave_id)
> >  {
> > -       if (lcore_config[slave_id].state == WAIT)
> > +       if (lcore_config[slave_id].state == WAIT) {
> > +               rte_rmb();
> >                 return 0;
> > +       }
> >
> >         while (lcore_config[slave_id].state != WAIT &&
> >                lcore_config[slave_id].state != FINISHED)
> > -----
> >
> > This is because in lib/librte_eal/linux/eal/eal_thread.c:
> >
> > -----
> > 		/* when a service core returns, it should go directly to WAIT
> > 		 * state, because the application will not lcore_wait() for it.
> > 		 */
> > 		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> > 			lcore_config[lcore_id].state = WAIT;
> > 		else
> > 			lcore_config[lcore_id].state = FINISHED;
> > -----
> >
> > NOTE that the service core skips the rte_eal_wait_lcore() code from
> > making the FINISHED->WAIT transition.  So I think at least that read
> > barrier will be needed (maybe I miss the pairing, though?).
> >
> > Additionally, I'm wondering if there is an additional write or sync
> > barrier needed to ensure that some of the transitions are properly
> > recorded when using lcore as a service lcore function.  The fact that
> > this only happens occasionally tells me that it's either a race (which
> > is possible... because the variable update in the test might not be
> > sync'd across cores or something), or some other missing
> > synchronization.
> >
> >> $ meson test service_autotest --repeat 50
> >>
> >> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
> >> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
> >> ...
> >> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
> >>
> >> OK:        50
> >> FAIL:       0
> >> SKIP:       0
> >> TIMEOUT:    0
> >>
> >> I'll keep it running for a few hours but I have little faith if it only
> >> takes 1 minute on your machines...
> >
> > Please try the flat command.
> 
> Not sure if you've had any time to look at this.

Apologies for delay in response - I've ran the existing tests a few 1000's of times during the week, with one reproduction. That's not enough for confidence in debug/fix for me.


> I think there's a change we can make, but not sure about how it fits in
> the overall service lcore design.

This suggestion is only changing the test code correct?


> The proposal is to use a pthread_cond variable which blocks the thread
> requesting the service function to run.  The service function merely
> sets the condition.  The requesting thread does a timed wait (up to 5s?)
> and if the timeout is exceeded can throw an error.  Otherwise, it will
> unblock and can assume that the test passes.  WDYT?  I think it works
> better than the racy code in the test case for now.

The idea/concept is right above, but I think that's what the test is
approximating anyway? The main thread does an "mp_wait_lcore()" until
the service core has returned, essentially a blocking call.

The test fails if the flag is not == 1 (as that indidcates failure in launching
an application function on a previously-use-as-service-core lthread).

I think your RMB suggestion is likely to be the correct, but I'd like to dig into it a bit more.

Thanks for the ping on this thread.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
  2019-10-14 16:48               ` Van Haaren, Harry
@ 2019-10-15 16:42                 ` Van Haaren, Harry
  0 siblings, 0 replies; 11+ messages in thread
From: Van Haaren, Harry @ 2019-10-15 16:42 UTC (permalink / raw)
  To: Van Haaren, Harry, Aaron Conole; +Cc: David Marchand, dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Van Haaren, Harry
> Sent: Monday, October 14, 2019 5:49 PM
> To: Aaron Conole <aconole@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
> failing
> 
> > -----Original Message-----
> > From: Aaron Conole [mailto:aconole@redhat.com]
> > Sent: Monday, October 14, 2019 3:54 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>
> > Cc: David Marchand <david.marchand@redhat.com>; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> > service_autotest failing
> >
> > Aaron Conole <aconole@redhat.com> writes:
> >
> > > "Van Haaren, Harry" <harry.van.haaren@intel.com> writes:
> > >
> > >>> -----Original Message-----
> > >>> From: Aaron Conole [mailto:aconole@redhat.com]
> > >>> Sent: Wednesday, September 4, 2019 8:56 PM
> > >>> To: David Marchand <david.marchand@redhat.com>
> > >>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> > >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> > service_autotest
> > >>> failing
> <snip lots of backlog>
> > >>> > real    2m42.884s
> > >>> > user    5m1.902s
> > >>> > sys    0m2.208s
> > >>>
> > >>> I can confirm - takes about 1m to fail.
> > >>
> > >>
> > >> Hi Aaron and David,
> > >>
> > >> I've been attempting to reproduce this, still no errors here.
> > >>
> > >> Given the nature of service-cores, and the difficulty to reproduce
> > >> here this feels like a race-condition - one that may not exist in all
> > >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
> > >>
> > >> I'm using Meson to build, so reproducing using this instead of the
> > command
> > >> as provided above. There should be no difference in reproducing due to
> > this:
> > >
> > > The command runs far more iterations than meson does (I think).
> > >
> > > I still see it periodically occur in the travis environment.
> > >
> > > I did see at least one missing memory barrier (I believe).  Please
> > > review the following code change (and if you agree I can submit it
> > > formally):
> > >
> > > -----
> > > --- a/lib/librte_eal/common/eal_common_launch.c
> > > +++ b/lib/librte_eal/common/eal_common_launch.c
> > > @@ -21,8 +21,10 @@
> > >  int
> > >  rte_eal_wait_lcore(unsigned slave_id)
> > >  {
> > > -       if (lcore_config[slave_id].state == WAIT)
> > > +       if (lcore_config[slave_id].state == WAIT) {
> > > +               rte_rmb();
> > >                 return 0;
> > > +       }
> > >
> > >         while (lcore_config[slave_id].state != WAIT &&
> > >                lcore_config[slave_id].state != FINISHED)
> > > -----
> > >
> > > This is because in lib/librte_eal/linux/eal/eal_thread.c:
> > >
> > > -----
> > > 		/* when a service core returns, it should go directly to WAIT
> > > 		 * state, because the application will not lcore_wait() for it.
> > > 		 */
> > > 		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> > > 			lcore_config[lcore_id].state = WAIT;
> > > 		else
> > > 			lcore_config[lcore_id].state = FINISHED;
> > > -----
> > >
> > > NOTE that the service core skips the rte_eal_wait_lcore() code from
> > > making the FINISHED->WAIT transition.  So I think at least that read
> > > barrier will be needed (maybe I miss the pairing, though?).
> > >
> > > Additionally, I'm wondering if there is an additional write or sync
> > > barrier needed to ensure that some of the transitions are properly
> > > recorded when using lcore as a service lcore function.  The fact that
> > > this only happens occasionally tells me that it's either a race (which
> > > is possible... because the variable update in the test might not be
> > > sync'd across cores or something), or some other missing
> > > synchronization.
> > >
> > >> $ meson test service_autotest --repeat 50
> > >>
> > >> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
> > >> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
> > >> ...
> > >> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
> > >>
> > >> OK:        50
> > >> FAIL:       0
> > >> SKIP:       0
> > >> TIMEOUT:    0
> > >>
> > >> I'll keep it running for a few hours but I have little faith if it only
> > >> takes 1 minute on your machines...
> > >
> > > Please try the flat command.
> >
> > Not sure if you've had any time to look at this.
> 
> Apologies for delay in response - I've ran the existing tests a few 1000's of
> times during the week, with one reproduction. That's not enough for confidence
> in debug/fix for me.
> 
> 
> > I think there's a change we can make, but not sure about how it fits in
> > the overall service lcore design.
> 
> This suggestion is only changing the test code correct?
> 
> 
> > The proposal is to use a pthread_cond variable which blocks the thread
> > requesting the service function to run.  The service function merely
> > sets the condition.  The requesting thread does a timed wait (up to 5s?)
> > and if the timeout is exceeded can throw an error.  Otherwise, it will
> > unblock and can assume that the test passes.  WDYT?  I think it works
> > better than the racy code in the test case for now.
> 
> The idea/concept is right above, but I think that's what the test is
> approximating anyway? The main thread does an "mp_wait_lcore()" until
> the service core has returned, essentially a blocking call.
> 
> The test fails if the flag is not == 1 (as that indidcates failure in
> launching
> an application function on a previously-use-as-service-core lthread).
> 
> I think your RMB suggestion is likely to be the correct, but I'd like to dig
> into it a bit more.
> 
> Thanks for the ping on this thread.

Good news - adding a rte_delay_ms() to the start of service_remote_launch_func()
makes this 100% (so far) reproducible here. So yes it's a race condition,
and I think I have a handle on why/what - it's an (lcore.state == WAIT) race.

To be continued...

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-15 16:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 14:45 [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing Aaron Conole
2019-09-04  9:41 ` Van Haaren, Harry
2019-09-04 10:04   ` David Marchand
2019-09-04 10:38     ` David Marchand
2019-09-04 19:56       ` Aaron Conole
2019-10-07  9:50         ` Van Haaren, Harry
2019-10-07 12:38           ` Aaron Conole
2019-10-14 14:53             ` Aaron Conole
2019-10-14 16:48               ` Van Haaren, Harry
2019-10-15 16:42                 ` Van Haaren, Harry
2019-09-04  9:55 ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).