On Wed, Aug 16, 2023 at 10:40 AM David Marchand <david.marchand@redhat.com> wrote:
Patrick, Bruce,

If it was reported, I either missed it or forgot about it, sorry.
Can you (re)share the context?

>
> Does the test suite pass if the mlx5 driver is disabled in the build? That
> could confirm or refute the suspicion of where the issue is, and also
> provide a temporary workaround while this set is merged (possibly including
> support for disabling specific tests, as I suggested in my other email).

Or disabling the driver as Bruce proposes.
 Okay, we ran the test with the mlx5 driver disabled, and it still fails. So, this might be more of an ARM architecture issue. Ruifeng, are you still seeing this on your test bed?

@David you didn't miss anything, we had a unicast with ARM when setting up the new arm container runners for unit testing a few months back. Ruifeng also noticed the same issue and speculated about mlx5 memory leaks. He raised the possibility of disabling the mlx5 driver too, but that option isn't great since we want to have a uniform build process (as much as possible) for our unit testing. Anyways, now we know that that isn't relevant. I'll forward the thread to you in any case - let me know if you have any ideas. 

>
> /Bruce
>
> PS: Are there any other workarounds inside the test/DTS/CI systems that
> involve patching sources? If so, it would be good to get a list that we can
> work through removing by putting place proper fixes or workarounds, as
> changing sources for testing like this blocks future patch acceptance. 

Patching sources from the test tool is a poor solution.
In general, developers won't be aware of source patching and will
waste time trying to understand why they can't reproduce what the CI
reports (it happened to me with DTS on the interrupt stuff with vhost,
at least).

For this specific case of skipping a test, if nobody can fix the
issue, I prefer if the CI can skip some "known broken in my lab" tests
via some meson configuration.
And, such configuration should be easy to catch in the test report.

I strongly agree on all points, which is why I said it was probably a good thing anyhow for us to lose this ability. In the case of the disabled fast-test for arm, that was a new discovery coming from adding new environments, not a regression introduced by a patch, and I don't think it made sense then to block the introduction of the entire unit test coverage for arm while they looked into this issue. If it's possible to introduce meson configure functionality to disable specific tests, that does give us more flexibility. And it's obviously a better process than us doing it at the CI end. 

We don't currently patch source in any other way in our CI testing.