On Wed, Aug 16, 2023 at 3:26 PM David Marchand <david.marchand@redhat.com> wrote:
On Wed, Aug 16, 2023 at 8:30 PM Patrick Robb <probb@iol.unh.edu> wrote:
> On Wed, Aug 16, 2023 at 10:40 AM David Marchand <david.marchand@redhat.com> wrote:
>>
>> Patrick, Bruce,
>>
>> If it was reported, I either missed it or forgot about it, sorry.
>> Can you (re)share the context?
>>
>>
>> >
>> > Does the test suite pass if the mlx5 driver is disabled in the build? That
>> > could confirm or refute the suspicion of where the issue is, and also
>> > provide a temporary workaround while this set is merged (possibly including
>> > support for disabling specific tests, as I suggested in my other email).
>>
>> Or disabling the driver as Bruce proposes.
>
>  Okay, we ran the test with the mlx5 driver disabled, and it still fails. So, this might be more of an ARM architecture issue. Ruifeng, are you still seeing this on your test bed?
>
> @David you didn't miss anything, we had a unicast with ARM when setting up the new arm container runners for unit testing a few months back. Ruifeng also noticed the same issue and speculated about mlx5 memory leaks. He raised the possibility of disabling the mlx5 driver too, but that option isn't great since we want to have a uniform build process (as much as possible) for our unit testing. Anyways, now we know that that isn't relevant. I'll forward the thread to you in any case - let me know if you have any ideas.

The mention of "memtest1" in the mails rings a bell.
I will need more detailed logs, or ideally an env where it is reproduced.

meson-logs/ for the unit tests run with eal_flags_file_prefix_autotest included shared with you via slack. I also shared the meson test summary, but of course it's the detailed testlog.txt you care about.  

One thing bothers me.. why are we not seeing this failure with ARM for
Bruce v6 series?
Just looking at patchwork, I would think that I can merge Bruce series as is.
https://patchwork.dpdk.org/project/dpdk/patch/20230816153439.551501-12-bruce.richardson@intel.com/

So, this is a niche edge case, but because we fail to apply the fast-test filtering script in our jenkinsfile script, we exit without doing any unit testing and don't save or report any results. Almost always if we fail doing "unh jenkins scipt" stuff, it's an infra failure, not a problem with a patch, and we don't want to report a false positive failure result there. It does further exemplify the danger in our current process, of course. I'll be glad to not have to do this anymore. I did try to make this point above, but I don't think I explained it too well.

The only other thing I'll add is that we are going to change our reporting process soon, to begin our pipeline run on a test/environment combo by reporting a "pending" result on that test/environent. Then we will overwrite it with a PASS or FAIL at the end. This helps protect us from situations like this. For instance, the way this would have played out is your would have had a label (iol-unit-arm64-testing) which would have had the initial "PENDING" result reported to it, but it never would have been updated from pending. So, you would know the CI results were incomplete.