DPDK CI discussions
 help / color / mirror / Atom feed
* Re: [PATCH v2] service: fix deadlock on worker lcore exit
       [not found]     ` <PH8PR11MB6803B474AE398A725C2A3583D7712@PH8PR11MB6803.namprd11.prod.outlook.com>
@ 2024-10-11  8:50       ` David Marchand
  0 siblings, 0 replies; only message in thread
From: David Marchand @ 2024-10-11  8:50 UTC (permalink / raw)
  To: Van Haaren, Harry, ci
  Cc: Mattias Rönnblom, dev, stephen, suanmingm, thomas, stable,
	Tyler Retzlaff, Aaron Conole

On Thu, Oct 3, 2024 at 5:50 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> > From: David Marchand <david.marchand@redhat.com>
> > Sent: Thursday, October 3, 2024 10:13 AM
> > To: Mattias Rönnblom <mattias.ronnblom@ericsson.com>; Van Haaren, Harry <harry.van.haaren@intel.com>
> > Cc: dev@dpdk.org <dev@dpdk.org>; stephen@networkplumber.org <stephen@networkplumber.org>; suanmingm@nvidia.com <suanmingm@nvidia.com>; thomas@monjalon.net <thomas@monjalon.net>; stable@dpdk.org <stable@dpdk.org>; Tyler Retzlaff <roretzla@linux.microsoft.com>; Aaron Conole <aconole@redhat.com>
> > Subject: Re: [PATCH v2] service: fix deadlock on worker lcore exit
> >
> > On Thu, Oct 3, 2024 at 8:57 AM David Marchand <david.marchand@redhat.com> wrote:
> > >
> > > From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > >
> > > Calling rte_exit() from a worker lcore thread causes a deadlock in
> > > rte_service_finalize().
> > >
> > > This patch makes rte_service_finalize() deadlock-free by avoiding the
> > > need to synchronize with service lcore threads, which in turn is
> > > achieved by moving service and per-lcore state from the heap to being
> > > statically allocated.
> > >
> > > The BSS segment increases with ~156 kB (on x86_64 with default
> > > RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).
> > >
> > > According to the service perf autotest, this change also results in a
> > > slight reduction of service framework overhead.
> > >
> > > Fixes: 33666b448f15 ("service: fix crash on exit")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > > ---
> > > Changes since v1:
> > > - rebased,
> >
> > I can't merge this patch in its current state.
> >
> > At the moment, two CI report a problem with the
> > eal_flags_file_prefix_autotest unit test.
> >
> > -------------------------------------stdout-------------------------------------
> > RTE>>eal_flags_file_prefix_autotest
> > Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
> > '--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
> > Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
> > '-m' '18' '--file-prefix=memtest1'
> > Error - hugepage files for memtest1 were not deleted!
> > Test Failed
> > RTE>>
> >
> > Can you have a look?
>
> Not sure how the code change in question is relating to the eal-flags failure, but I can reproduce the failure here.
> Reproducing issue on *all* of the below tags; this indicates its likely a board-config issue, and not a true issue (unless its been there since 23.11??).
>
> Tested commits were all bad:
> b3485f4293 (HEAD, tag: v24.07) version: 24.07.0
> a9778aad62 (HEAD, tag: v24.03) version: 24.03.0
> eeb0605f11 (HEAD, tag: v23.11) version: 23.11.0
>
> So I'm pretty sure this is a board/runner config issue, with the error output as follows here:
> RTE>>eal_flags_file_prefix_autotest
> Running binary with argv[]:'./app/test/dpdk-test' '--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
> EAL: Detected CPU lcores: 64
> EAL: Detected NUMA nodes: 2
> EAL: Detected static linkage of DPDK
> EAL: Cannot open '/var/run/dpdk/memtest/config' for rte_mem_config
> EAL: FATAL: Cannot init config
> EAL: Cannot init config
>
> FAIL:
> DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test  --no-pci
>
> PASS:
> DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test
>
> So seems like the eal-flags test is NOT able to handle args like "--no-pci"? I tend to run tests in no PCI mode to speed up things :)

Well, speeding up, or hiding the issue, I guess.

> In short, this service-cores patch is not the root cause. Perhaps some of the CI folks can confirm if there's extra args passed to the runner?

To be clear, I can't merge this patch because of this (systematic)
failure in many CI env (GHA, LoongArch, UNH).

Adding CI ml in the loop.


-- 
David Marchand


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-10-11  8:51 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20241001162603.793853-1-mattias.ronnblom@ericsson.com>
     [not found] ` <20241003065702.3051158-1-david.marchand@redhat.com>
     [not found]   ` <CAJFAV8zgd_EJwbshjAWNMe4m2s=btu+6cY9ToA0vmudz89svaw@mail.gmail.com>
     [not found]     ` <PH8PR11MB6803B474AE398A725C2A3583D7712@PH8PR11MB6803.namprd11.prod.outlook.com>
2024-10-11  8:50       ` [PATCH v2] service: fix deadlock on worker lcore exit David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).