On Wed, Aug 16, 2023 at 3:26 PM David Marchand <david.marchand@redhat.com>
wrote:

> On Wed, Aug 16, 2023 at 8:30 PM Patrick Robb <probb@iol.unh.edu> wrote:
> > On Wed, Aug 16, 2023 at 10:40 AM David Marchand <
> david.marchand@redhat.com> wrote:
> >>
> >> Patrick, Bruce,
> >>
> >> If it was reported, I either missed it or forgot about it, sorry.
> >> Can you (re)share the context?
> >>
> >>
> >> >
> >> > Does the test suite pass if the mlx5 driver is disabled in the build?
> That
> >> > could confirm or refute the suspicion of where the issue is, and also
> >> > provide a temporary workaround while this set is merged (possibly
> including
> >> > support for disabling specific tests, as I suggested in my other
> email).
> >>
> >> Or disabling the driver as Bruce proposes.
> >
> >  Okay, we ran the test with the mlx5 driver disabled, and it still
> fails. So, this might be more of an ARM architecture issue. Ruifeng, are
> you still seeing this on your test bed?
> >
> > @David you didn't miss anything, we had a unicast with ARM when setting
> up the new arm container runners for unit testing a few months back.
> Ruifeng also noticed the same issue and speculated about mlx5 memory leaks.
> He raised the possibility of disabling the mlx5 driver too, but that option
> isn't great since we want to have a uniform build process (as much as
> possible) for our unit testing. Anyways, now we know that that isn't
> relevant. I'll forward the thread to you in any case - let me know if you
> have any ideas.
>
> The mention of "memtest1" in the mails rings a bell.
> I will need more detailed logs, or ideally an env where it is reproduced.
>
> meson-logs/ for the unit tests run with eal_flags_file_prefix_autotest
included shared with you via slack. I also shared the meson test summary,
but of course it's the detailed testlog.txt you care about.

>
> One thing bothers me.. why are we not seeing this failure with ARM for
> Bruce v6 series?
> Just looking at patchwork, I would think that I can merge Bruce series as
> is.
>
> https://patchwork.dpdk.org/project/dpdk/patch/20230816153439.551501-12-bruce.richardson@intel.com/
>
> So, this is a niche edge case, but because we fail to apply the fast-test
filtering script in our jenkinsfile script, we exit without doing any unit
testing and don't save or report any results. Almost always if we fail
doing "unh jenkins scipt" stuff, it's an infra failure, not a problem with
a patch, and we don't want to report a false positive failure result there.
It does further exemplify the danger in our current process, of course.
I'll be glad to not have to do this anymore. I did try to make this point
above, but I don't think I explained it too well.

The only other thing I'll add is that we are going to change our reporting
process soon, to begin our pipeline run on a test/environment combo by
reporting a "pending" result on that test/environent. Then we will
overwrite it with a PASS or FAIL at the end. This helps protect us from
situations like this. For instance, the way this would have played out is
your would have had a label (iol-unit-arm64-testing) which would have had
the initial "PENDING" result reported to it, but it never would have been
updated from pending. So, you would know the CI results were incomplete.