Hi Andrew, I've attached the public key we'd like to use. Thanks, Adam On Thu, Oct 26, 2023 at 8:19 AM Andrew Rybchenko < andrew.rybchenko@oktetlabs.ru> wrote: > Hi Adam, > > On 10/25/23 23:27, Adam Hassick wrote: > > Hi Andrew, > > Sorry about the two week radio silence, we're still trying to sort out > logistics for deployment on our end. > > > I've created setup on ts-factory.io which allows to publish logs. Let > me know if you'd like to try it and I'll provide credentials, script and > short instruction. > > We're interested in publishing some test logs to the ts-factory Bublik > instance in the meantime. > > > Please, send me SSH public key which you'd like to use to upload logs. > I'll provide helper script and instructions to do it. > > Andrew. > > > Thanks, > Adam > > On Mon, Oct 23, 2023 at 7:11 AM Andrew Rybchenko < > andrew.rybchenko@oktetlabs.ru> wrote: > >> Hi Adam, >> >> > Now that our test results are in line with yours, we can begin looking >> into setting up the production environment. >> >> Please, let me know if you need any help with it or waiting for an input >> from me. >> >> Regards, >> Andrew. >> >> On 10/10/23 17:09, Adam Hassick wrote: >> >> Hi Andrew, >> >> Thank you for taking a look at our log. Netplan was attempting to run >> DHCP on our test links, and additionally I discovered that the NIC firmware >> was transmitting LLDP packets, causing tests to fail in the same way. Now >> that these problems have been fixed, our pass rate on the XL710 is >> approximately 91%. Now that our test results are in line with yours, we can >> begin looking into setting up the production environment. >> >> First, is it possible to run the test agent on ARM hosts? Our ARM >> testbeds have the best topology for running this test suite, with separate >> tester and DUT servers. >> >> We are testing this test suite on two x86 development servers using the >> test suite's recommended server topology. In contrast, our existing x86 >> production testbeds which run DTS have a single server topology. This >> single server has both the tester NIC and the device under test NIC >> installed, with NUMA node separation between TRex and DPDK. We're going to >> test running the two test agent processes on the single-server testbeds if >> we cannot run this on ARM. Is there any reason you can think of that would >> prevent this setup from working? >> >> Once we figure out where this can live in production, then we will begin >> setting up log storage, Jenkins integration, and Bublik. >> >> Thanks, >> Adam >> >> On Thu, Oct 5, 2023 at 6:25 AM Andrew Rybchenko < >> andrew.rybchenko@oktetlabs.ru> wrote: >> >>> Hi Adam, >>> >>> > Do these default to vfio-pci? >>> >>> Yes, vfio-pci is the default. >>> However, it does not work in the case of Mellanox which uses bifurcated >>> driver. It should mlx5_core for Mellanox NICs. >>> >>> > Here is the text log from a run on our Intel XL710 NICs, with the >>> expected result profile set to the X710. >>> >>> It is hard to analyze all tests using text logs, but I definitely see >>> one common problem. Tests receive unexpected packets and fail because of it. >>> Tests are written very strict from this point of view and it brought >>> fruits in the past when HW had bugs. >>> Are DUT and tester connected back-to-back on tested interfaces or via >>> switch? >>> If via switch, is it possible to isolate it from everything else? >>> If back-to-back, it could be some embedded SW/FW which regenerates these >>> packet. >>> I definitely see unexpected DHCP packets. >>> >>> > We haven't set up the Jenkins integration yet, however if this is >>> required to import the logs then we will prioritize that. >>> >>> Unfortunately manual runs do not generate all artifacts required to >>> import logs. However, we have almost solved it right now. Hopefully we'll >>> finalize it in a day or two. I'll let you know when these changes are >>> available. >>> >>> Regards, >>> Andrew. >>> >>> On 10/4/23 16:48, Adam Hassick wrote: >>> >>> Hi Andrew, >>> >>> Ok, that makes sense. I don't see TE_ENV_H1/H2_DPDK_DRIVER set anywhere >>> in the default configurations for the Intel X710. Do these default to >>> vfio-pci? >>> We have IOMMU enabled on our development testbed, and should be able to >>> bind vfio-pci. >>> Here is the text log from a run on our Intel XL710 NICs, with the >>> expected result profile set to the X710. We haven't set up the Jenkins >>> integration yet, however if this is required to import the logs then we >>> will prioritize that. >>> log.txt.tar.gz >>> >>> >>> Thanks, >>> Adam >>> >>> On Mon, Sep 18, 2023 at 11:04 AM Andrew Rybchenko < >>> andrew.rybchenko@oktetlabs.ru> wrote: >>> >>>> On 9/18/23 17:44, Adam Hassick wrote: >>>> >>>> Hi Andrew and Konstantin, >>>> >>>> Thank you for adding the tester-dial feature, this opens up the >>>> possibility for us to do CI integrated testing in the future. >>>> >>>> Our Mellanox pass rate is similar to yours (about ~2400 passing, ~4400 >>>> failing), however our Intel pass rates are far worse. >>>> I will try running tests on the XL710 with the trc-tags argument set >>>> and see if it improves the pass rate. >>>> Another thing I noticed in the results you uploaded is that the results >>>> are tagged with vfio-pci and not i40e. >>>> Though in the environment dump, the driver on the test machine and the >>>> DUT are set to use the i40e driver. Is this important at all? >>>> >>>> >>>> I think it is a misunderstanding here. There are two kinds of driver in >>>> configuration: net driver and so-called DPDK driver. >>>> Net driver is a Linux kernel network device driver used on Tester side. >>>> DPDK driver is a Linux kernel driver to bind device to to use it with >>>> DPDK. So, it is NOT a driver inside DPDK (drivers/net/*). >>>> In the case of bifurcated driver (like mlx5_core) it is the same in >>>> both cases. >>>> In non-bifurcated case DPDK driver is some UIO driver(vfio-pci, >>>> uio-pci-generic or igb_uio). >>>> Some expectations depend on used UIO. For example, uio-pci-generic do >>>> not support many interrupts (used by usecases/rx_intr test cases). >>>> That's why we care corresponding TRC tag. >>>> >>>> TE_ENV_*_DPDK_DRIVER variables should be vfio-pc in 710's Intel case. >>>> Or uio-pci-generic if IOMMU is turned off on corresponding machines and >>>> Linux distro does not support VFIO no IOMMU mode. >>>> >>>> Andrew. >>>> >>>> There isn't anything preventing us from pushing our results up to the >>>> existing Bublik instance running at ts-factory.io that I can think of >>>> at the moment. >>>> We will have to work out how to submit our results to your Bublik >>>> instance in a controlled and secure manner in that case. >>>> As far as I know we won't need access controls for the results >>>> themselves. I'll discuss this with Patrick and will let you know once we >>>> confirm that it's fine. >>>> >>>> Thanks, >>>> Adam >>>> >>>> On Mon, Sep 18, 2023 at 2:26 AM Andrew Rybchenko < >>>> andrew.rybchenko@oktetlabs.ru> wrote: >>>> >>>>> On 9/18/23 09:23, Konstantin Ushakov wrote: >>>>> >>>>> Hi Andrew, >>>>> >>>>> should we always auto-assign the tags or you don’t do it since it >>>>> slows down (by some seconds) the TE startup? >>>>> >>>>> >>>>> Tags are auto-assigned, but I guess it differs in Adam's case since >>>>> NIC is a bit different. Below test will help to understand if it is the >>>>> root cause of very different expectations. If pass rate will be close to >>>>> mine, I'll simply update TRC database to share expectations for mine NIC >>>>> and NIC used by Adam. >>>>> >>>>> Hi Adam, >>>>> >>>>> I think I second the question from Andrew - happy to help you with the >>>>> triage so that we get to the same baseline. Do you have a good way for us >>>>> to share the logs? I.e. say upload to ts-factory if we add strict >>>>> permissions system so it’s not publishing or any other way. >>>>> >>>>> Thanks, >>>>> Konstantin >>>>> >>>>> On 18 Sep 2023, at 9:15, Andrew Rybchenko wrote: >>>>> >>>>> Hi Adam, >>>>> >>>>> I've uploaded fresh testing results to ts-factory.io [1] to be on the >>>>> same page. >>>>> >>>>> I think I know why your and mine results on Intel 710 series NICs >>>>> differ so much. Testing results expectations database >>>>> (dpdk-ethdev-ts/trc/*) is filled in in terms of TRC tags. I.e. >>>>> expectations depends on TRC tags discovered by helper scripts when testing >>>>> is started. These tags identify various aspects of what is tested. Ideally >>>>> expectations should be written in terms of root cause of the expected >>>>> behaviour. If it is a driver expectations, driver tag should be used. If it >>>>> is HW limitation, tags with PCI IDs should be used. However, it is not >>>>> always easy to classify it correctly if you're not involved in driver >>>>> development. So, in order case expectations for 710's Intel are filled in >>>>> in terms of PCI IDs. I guess PCI ID differ in your case and that's why >>>>> expectations filled in for my NIC do not apply to your runs. >>>>> >>>>> Just try to add the following option when you run on your 710's Intel >>>>> in order to mimic mine and see if it helps to achieve better pass rate. >>>>> --trc-tag=pci-8086-1572 >>>>> >>>>> BTW, fresh TE tag v1.21.0 has improved algorithm to choose tests for >>>>> --tester-dial option. It should have better coverage now. >>>>> >>>>> Andrew. >>>>> >>>>> [1] >>>>> https://ts-factory.io/bublik/v2/runs?startDate=2023-09-16&finishDate=2023-09-16&runData=&runDataExpr=&page=1 >>>>> >>>>> On 9/13/23 18:45, Andrew Rybchenko wrote: >>>>> >>>>> Hi Adam, >>>>> >>>>> I've pushed new TE tag v1.20.0 which supported a new command-line >>>>> option --tester-dial=NUM where NUM is from 0 to 100. it allows to choose >>>>> percentage of tests to run. If you want stable set, you should pass >>>>> --tester-random-seed=0 (or other integer). It is the first sketch and we >>>>> have plans to improve it, but feedback would be welcome. >>>>> >>>>> > Is it needed on the tester? >>>>> >>>>> It is hard to say if it is strictly required for simple tests. >>>>> However, it is better to update Tester as well, since performance tests run >>>>> DPDK on Tester as well. >>>>> >>>>> > Are there any other manual setup steps for these devices that I >>>>> might be missing? >>>>> >>>>> I don't remember anything else. >>>>> >>>>> I think it is better to get down to details and take a look at logs. >>>>> I'm ready to help with it and explain what's happening there. May be it >>>>> will help to understand if it is a problem with setup/configuration. >>>>> >>>>> Text logs are not very convenient. Ideally logs should be imported to >>>>> bublik, however, manual runs do not provide all required artifacts right >>>>> now (Jenkins jobs generate all required artifacts). >>>>> Other option is 'tmp_raw_log' file (should be packed to make it >>>>> smaller) which could be converted to various log formats. >>>>> Would it be OK for you if I import your logs to bublik at >>>>> ts-factory.io? Or is it a problem that it is publicly available? >>>>> Would it help if we add authentication and access control there? >>>>> >>>>> Andrew. >>>>> >>>>> On 9/8/23 17:57, Adam Hassick wrote: >>>>> >>>>> Hi Andrew, >>>>> >>>>> I have a couple questions about needed setup of the NICs for the >>>>> ethdev test suite. >>>>> >>>>> Our MCX5s and XL710s are failing the checkup tests. The pass rate >>>>> appears to be much worse on the XL710s (40 of 73 tests failed, 3 passed >>>>> unexpectedly). >>>>> >>>>> For the XL710s, I've updated the driver and NVM versions to match the >>>>> minimum supported versions in the compatibility matrix found on the DPDK >>>>> documentation. This did not change the failure rate much. >>>>> For the MCX5s, I've installed the latest LTS version of the OFED >>>>> bifurcated driver on the DUT. Is it needed on the tester? >>>>> >>>>> Are there any other manual setup steps for these devices that I might >>>>> be missing? >>>>> >>>>> Thanks, >>>>> Adam >>>>> >>>>> On Wed, Sep 6, 2023 at 11:00 AM Adam Hassick >>>>> wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> Yes, I copied the X710 configs to set up XL710 configs. I changed the >>>>>> environment variable names from the X710 suffix to XL710 suffix in the >>>>>> script, and forgot to change them in the corresponding environment file. >>>>>> That fixed the issue. >>>>>> >>>>>> I got the checkup tests working on the XL710 now. Most of them are >>>>>> failing, which leads me to believe this is an issue with our testbed. Based >>>>>> on the DPDK documentation for i40e, the firmware and driver versions are >>>>>> much older than what DPDK 22.11 LTS and main prefer, so I'll try updating >>>>>> those. >>>>>> >>>>>> For now I'm working on getting the XL710 checkup tests passing, and >>>>>> will pick up getting the E810 configured properly next. I'll let you know >>>>>> if I run into any more issues in relation to the test engine. >>>>>> >>>>>> Thanks, >>>>>> Adam >>>>>> >>>>>> On Wed, Sep 6, 2023 at 7:36 AM Andrew Rybchenko < >>>>>> andrew.rybchenko@oktetlabs.ru> wrote: >>>>>> >>>>>>> Hi Adam, >>>>>>> >>>>>>> On 9/5/23 18:01, Adam Hassick wrote: >>>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> The compilation warning issue is now resolved. Again, thank you guys >>>>>>> for fixing this for us. I can run the tests on the Mellanox CX5s again, >>>>>>> however I'm running into a couple new issues with running the prologues on >>>>>>> the Intel cards. >>>>>>> >>>>>>> When running testing on the Intel XL710s, I see this error appear in >>>>>>> the log: >>>>>>> >>>>>>> ERROR prologue Environment LIB 14:16:13.650 >>>>>>>> Too few networks in available configuration (0) in comparison with >>>>>>>> required (1) >>>>>>>> >>>>>>> >>>>>>> This seems like a trivial configuration error, perhaps this is >>>>>>> something I need to set up in ts-rigs. I briefly searched through the >>>>>>> examples there and didn't see any mention of how to set up a network. >>>>>>> I will attach this log just in case you need more information. >>>>>>> >>>>>>> >>>>>>> Unfortunately logs are insufficient to understand it. I've pushed >>>>>>> new tag to TE v1.19.0 which add log message with TE_* environment variables. >>>>>>> Most likely something is wrong with variables which are used as >>>>>>> conditions when available networks are defined in >>>>>>> ts-conf/cs/inc.net_cfg_pci_fns.yml: >>>>>>> TE_PCI_INSTANCE_IUT_TST1 >>>>>>> TE_PCI_INSTANCE_IUT_TST1a >>>>>>> TE_PCI_INSTANCE_TST1a_IUT >>>>>>> TE_PCI_INSTANCE_TST1_IUT >>>>>>> My guess it that you change naming a bit, but script like >>>>>>> ts-rigs-sample/scripts/iut.h1-x710 is not included or not updated. >>>>>>> >>>>>>> There is a different error when running on the Intel E810s. It >>>>>>> appears to me like it starts DPDK, does some configuration inside DPDK and >>>>>>> on the device, and then fails to bring the device back up. Since this error >>>>>>> seems very non-trivial, I will also attach this log. >>>>>>> >>>>>>> >>>>>>> This one is a bit simpler. Few lines after the first ERROR in log I >>>>>>> see the following: >>>>>>> WARN RCF DPDK 13:06:00.144 >>>>>>> ice_program_hw_rx_queue(): currently package doesn't support RXDID >>>>>>> (22) >>>>>>> ice_rx_queue_start(): fail to program RX queue 0 >>>>>>> ice_dev_start(): fail to start Rx queue 0 >>>>>>> Device with port_id=0 already stopped >>>>>>> >>>>>>> It is stdout/stderr from test agent which runs DPDK. Same logs in >>>>>>> plain format are available in ta.DPDK file. >>>>>>> I'm not an expert here, but I vaguely remember that E810 requires >>>>>>> correct firmware and DDP to be loaded. >>>>>>> There is some information in dpdk/doc/guides/nics/ice.rst. >>>>>>> >>>>>>> You can try to add --dev-args=safe-mode-support=1 command-line >>>>>>> option described there. >>>>>>> >>>>>>> Hope it helps, >>>>>>> Andrew. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Adam >>>>>>> >>>>>>> On Fri, Sep 1, 2023 at 3:59 AM Andrew Rybchenko < >>>>>>> andrew.rybchenko@oktetlabs.ru> wrote: >>>>>>> >>>>>>>> Hi Adam, >>>>>>>> >>>>>>>> On 8/31/23 22:38, Adam Hassick wrote: >>>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> I have one additional question as well: Does the test engine >>>>>>>> support running tests on two ARMv8 test agents? >>>>>>>> >>>>>>>> 1. We'll sort out warnings this week. Thanks for heads up. >>>>>>>>> >>>>>>>> >>>>>>>> Great. Let me know when that's fixed. >>>>>>>> >>>>>>>> >>>>>>>> Done. We also fixed a number of warnings in TE. >>>>>>>> Also we fixed root test package name to be consistent with the >>>>>>>> repository name. >>>>>>>> >>>>>>>> Support for old LTS branches was dropped some time ago, but in the >>>>>>>>> future it is definitely possible to keep it for new LTS branches. I think >>>>>>>>> 22.11 is supported, but I'm not sure about older LTS releases. >>>>>>>>> >>>>>>>> >>>>>>>> Good to know. >>>>>>>> >>>>>>>> >>>>>>>>> 2. You can add command-line option --sanity to run tests marked >>>>>>>>> with TEST_HARNESS_SANITY requirement (see dpdk-ethdev-ts/scripts/run.sh and >>>>>>>>> grep TEST_HARNESS_SANITY dpdk-ethdev-ts to see which tests are marked). >>>>>>>>> Yes, there is a space for terminology improvement here. We'll do it. >>>>>>>>> >>>>>>>> >>>>>>>> Done. Now it is called --checkup. >>>>>>>> >>>>>>>> >>>>>>>>> Also it takes a lot of time because of failures and tests which >>>>>>>>> wait for some timeout. >>>>>>>>> >>>>>>>> >>>>>>>> That makes sense to me. We'll use the time to complete tests on >>>>>>>> virtio or the Intel devices as a reference for how long the tests really >>>>>>>> take to complete. >>>>>>>> We will explore the possibility of periodically running the sanity >>>>>>>> tests for patches. >>>>>>>> >>>>>>>> >>>>>>>> I'll double-check and let you know how long entire TS runs on Intel >>>>>>>> X710, E810, Mellanox CX5 and virtio net. Just to ensure that time observed >>>>>>>> in your case looks the same. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> The test harness can provide coverage reports based on gcov, but >>>>>>>>> I'm not sure what you mean by a "dial" to control test coverage. Provided >>>>>>>>> reports are rather for human to analyze. >>>>>>>>> >>>>>>>> >>>>>>>> The general idea is to have some kind of parameter on the test >>>>>>>> suite, which could be an integer ranging from zero to ten, that controls >>>>>>>> how many tests are run based on how important the test is. >>>>>>>> >>>>>>>> Similar to how some command line interfaces provide a verbosity >>>>>>>> level parameter (some number of "-v" arguments) to control the importance >>>>>>>> of the information in the log. >>>>>>>> The verbosity level zero only prints very important log messages, >>>>>>>> while ten prints everything. >>>>>>>> >>>>>>>> In much the same manner as above, this "dial" parameter controls >>>>>>>> what tests are run and with what parameters based on how important those >>>>>>>> tests and test parameter combinations are. >>>>>>>> Coverage Level zero tells the suite to run a very basic set of >>>>>>>> important tests, with minimal parameterization. This mode would take only >>>>>>>> ~5-10 minutes to run. >>>>>>>> In contrast, Coverage Level ten includes all the edge cases, every >>>>>>>> combination of test parameters, everything the test suite can do, which >>>>>>>> takes the normal several hours to run. >>>>>>>> The values 1 - 9 are between those two extremes, allowing the user >>>>>>>> to get a gradient of test coverage in the results and to limit the running >>>>>>>> time. >>>>>>>> >>>>>>>> Then we could, for example, run the "run.sh" with a level of 2 or 3 >>>>>>>> for incoming patches that need quick results, and with a level of 10 for >>>>>>>> the less often run periodic tests performed on main or LTS branches. >>>>>>>> >>>>>>>> >>>>>>>> Understood now. Thanks a lot for the idea. We'll discuss it and >>>>>>>> come back. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 3. Yes, really many tests on Mellanox CX5 NICs report unexpected >>>>>>>>> testing results. Unfortunately it is time consuming to fill in expectations >>>>>>>>> database since it is necessary to analyze testing results and classify if >>>>>>>>> it is a bug or just acceptable behaviour aspect. >>>>>>>>> >>>>>>>>> Bublik allows to compare results of two runs. It is useful for >>>>>>>>> human, but still not good for automation. >>>>>>>>> >>>>>>>>> I have local patch for mlx5 driver which reports Tx ring size >>>>>>>>> maximum. It makes pass rate higher. It is a problem for test harness that >>>>>>>>> mlx5 does not report limits right now. >>>>>>>>> >>>>>>>>> Pass rate on Intel X710 is about 92% on my test rig. Pass rate on >>>>>>>>> virtio net is 99% right now and could be done 100% easily (just one thing >>>>>>>>> to fix in expectations). >>>>>>>>> >>>>>>>>> I think logs storage setup is essential for logs analysis. Of >>>>>>>>> course, you can request HTML logs when you run tests (--log-html=html) or >>>>>>>>> generate after run using dpdk-ethdev-ts/scripts/html-log.sh and open >>>>>>>>> index.html in a browser, but logs storage makes it more convenient. >>>>>>>>> >>>>>>>> >>>>>>>> We are interested in setting up Bublik, potentially as an >>>>>>>> externally-facing component, once we have our process of running the test >>>>>>>> suite stabilized. >>>>>>>> Once we are able to run the test suite again, I'll see what the >>>>>>>> pass rate is on our other hardware. >>>>>>>> Good to know that it isn't an issue with our dev testbed causing >>>>>>>> the high fail rate. >>>>>>>> >>>>>>>> For Intel hardware, we have an XL710 and an Intel E810-C in our >>>>>>>> development testbed. Although they are slightly different devices, ideally >>>>>>>> the pass rate will be identical or similar. I have yet to set up a VM pair >>>>>>>> for virtio, but we will soon. >>>>>>>> >>>>>>>> Latest version of test-environment has examples of our CGI scripts >>>>>>>>> which we use for log storage (see tools/log_server/README.md). >>>>>>>>> >>>>>>>>> Also all bits for Jenkins setup are available. See >>>>>>>>> dpdk-ethdev-ts/jenkins/README.md and examples of jenkins files in >>>>>>>>> ts-rigs-sample. >>>>>>>>> >>>>>>>> >>>>>>>> Jenkins integration, setting up production rig configurations, and >>>>>>>> permanent log storage will be our next steps once I am able to run the >>>>>>>> tests again. >>>>>>>> Unless there is an easy way to have meson not pass "-Werror" into >>>>>>>> GCC. Then I would be able to run the test suite. >>>>>>>> >>>>>>>> >>>>>>>> Hopefully it is resolved now. >>>>>>>> >>>>>>>> I thought a bit more about your usecase for Jenkins. I'm not 100% >>>>>>>> sure that existing pipelines are convenient for your usecase. >>>>>>>> Fill free to ask questions when you are on it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Andrew. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Adam >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> On 8/29/23 17:02, Adam Hassick wrote: >>>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> That fix seems to have resolved the issue, thanks for the quick >>>>>>>>> turnaround time on that patch. >>>>>>>>> Now that we have the RCF timeout issue resolved, there are a few >>>>>>>>> other questions and issues that we have about the tests themselves. >>>>>>>>> >>>>>>>>> 1. The test suite fails to build with a couple warnings. >>>>>>>>> >>>>>>>>> Below is the stderr log from compilation: >>>>>>>>> >>>>>>>>> FAILED: lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o >>>>>>>>>> cc -Ilib/76b5a35@@ts_dpdk_pmd@sta -Ilib -I../../lib >>>>>>>>>> -I/opt/tsf/dpdk-ethdev-ts/ts/inst/default/include >>>>>>>>>> -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch >>>>>>>>>> -Werror -g -D_GNU_SOURCE -O0 -ggdb -Wall -W -fPIC -MD -MQ ' >>>>>>>>>> lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o' -MF ' >>>>>>>>>> lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o.d' -o ' >>>>>>>>>> lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o' -c >>>>>>>>>> ../../lib/dpdk_pmd_ts.c >>>>>>>>>> ../../lib/dpdk_pmd_ts.c: In function >>>>>>>>>> ‘test_create_traffic_generator_params’: >>>>>>>>>> ../../lib/dpdk_pmd_ts.c:5577:5: error: format not a string >>>>>>>>>> literal and no format arguments [-Werror=format-security] >>>>>>>>>> 5577 | rc = te_kvpair_add(result, buf, mode); >>>>>>>>>> | ^~ >>>>>>>>>> cc1: all warnings being treated as errors >>>>>>>>>> ninja: build stopped: subcommand failed. >>>>>>>>>> ninja: Entering directory `.' >>>>>>>>>> FAILED: lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o >>>>>>>>>> cc -Ilib/76b5a35@@ts_dpdk_pmd@sta -Ilib -I../../lib >>>>>>>>>> -I/opt/tsf/dpdk-ethdev-ts/ts/inst/default/include >>>>>>>>>> -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch >>>>>>>>>> -Werror -g -D_GNU_SOURCE -O0 -ggdb -Wall -W -fPIC -MD -MQ ' >>>>>>>>>> lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o' -MF ' >>>>>>>>>> lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o.d' -o ' >>>>>>>>>> lib/76b5a35@@ts_dpdk_pmd@sta/dpdk_pmd_ts.c.o' -c >>>>>>>>>> ../../lib/dpdk_pmd_ts.c >>>>>>>>>> ../../lib/dpdk_pmd_ts.c: In function >>>>>>>>>> ‘test_create_traffic_generator_params’: >>>>>>>>>> ../../lib/dpdk_pmd_ts.c:5577:5: error: format not a string >>>>>>>>>> literal and no format arguments [-Werror=format-security] >>>>>>>>>> 5577 | rc = te_kvpair_add(result, buf, mode); >>>>>>>>>> | ^~ >>>>>>>>>> cc1: all warnings being treated as errors >>>>>>>>>> >>>>>>>>> >>>>>>>>> This error wasn't occurring last week, which was the last time I >>>>>>>>> ran the tests. >>>>>>>>> The TE host and the DUT have GCC v9.4.0 installed, and the tester >>>>>>>>> has GCC v11.4.0 installed, if this information is helpful. >>>>>>>>> >>>>>>>>> 2. On the Mellanox CX5s, there are over 6,000 tests run, which >>>>>>>>> collectively take around 9 hours. Is it possible, and would it make sense, >>>>>>>>> to lower the test coverage and have the test suite run faster? >>>>>>>>> >>>>>>>>> For some context, we run immediate testing on incoming patches for >>>>>>>>> DPDK main and development branches, as well as periodic test runs on the >>>>>>>>> main, stable, and LTS branches. >>>>>>>>> For us to consider including this test suite as part of our >>>>>>>>> immediate testing on patches, we would have to reduce the test coverage to >>>>>>>>> the most important tests. >>>>>>>>> This is primarily to reduce the testing time to, for example, less >>>>>>>>> than 30 minutes. Testing on patches can't take too long because the lab can >>>>>>>>> receive numerous patches each day, which each require individual testing >>>>>>>>> runs. >>>>>>>>> >>>>>>>>> At what frequency we run these tests, and on what, still needs to >>>>>>>>> be discussed with the DPDK community, but it would be nice to know if the >>>>>>>>> test suite had a "dial" to control the testing coverage. >>>>>>>>> >>>>>>>>> 3. We see a lot of test failures on our Mellanox CX5 NICs. Around >>>>>>>>> 2,300 of ~6,600 tests passed. Is there anything we can do to diagnose these >>>>>>>>> test failures? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Adam >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Aug 29, 2023 at 8:07 AM Andrew Rybchenko < >>>>>>>>> andrew.rybchenko@oktetlabs.ru> wrote: >>>>>>>>> >>>>>>>>>> Hi Adam, >>>>>>>>>> >>>>>>>>>> I've pushed the fix in main branch and a new tag v1.18.1. It >>>>>>>>>> should solve the problem with IPv6 address from DNS. >>>>>>>>>> >>>>>>>>>> Andrew. >>>>>>>>>> >>>>>>>>>> On 8/29/23 00:05, Andrew Rybchenko wrote: >>>>>>>>>> >>>>>>>>>> Hi Adam, >>>>>>>>>> >>>>>>>>>> > Does the test engine prefer to use IPv6 over IPv4 for >>>>>>>>>> initiating the RCF connection to the test bed hosts? And if so, is there a >>>>>>>>>> way to force it to use IPv4? >>>>>>>>>> >>>>>>>>>> Brilliant idea. If DNS returns both IPv4 and IPv6 addresses in >>>>>>>>>> your case, I guess it is the root cause of the problem. >>>>>>>>>> Of course, it is TE problem since I see really weird code in >>>>>>>>>> lib/comm_net_engine/comm_net_engine.c line 135. >>>>>>>>>> >>>>>>>>>> I've pushed fix to the branch user/arybchik/fix_ipv4_only in >>>>>>>>>> ts-factory/test-environment repository. Please, try. >>>>>>>>>> >>>>>>>>>> It is late night fix with minimal testing and no review. I'll >>>>>>>>>> pass it through review process tomorrow and >>>>>>>>>> hopefully it will be released in one-two days. >>>>>>>>>> >>>>>>>>>> Andrew. >>>>>>>>>> >>>>>>>>>> On 8/28/23 18:02, Adam Hassick wrote: >>>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> We have yet to notice a distinct pattern with the failures. >>>>>>>>>> Sometimes, the RCF will start and connect without issue a few times in a >>>>>>>>>> row before failing to connect again. Once the issue begins to occur, >>>>>>>>>> neither rebooting all of the hosts (test engine VM, tester, IUT) or >>>>>>>>>> deleting all of the build directories (suites, agents, inst) and rebooting >>>>>>>>>> the hosts afterward resolves the issue. When it begins working again seems >>>>>>>>>> very arbitrary to us. >>>>>>>>>> >>>>>>>>>> I do usually try to terminate the test engine with Ctrl+C, but >>>>>>>>>> when it hangs while trying to start RCF, that does not work. >>>>>>>>>> >>>>>>>>>> Does the test engine prefer to use IPv6 over IPv4 for initiating >>>>>>>>>> the RCF connection to the test bed hosts? And if so, is there a way to >>>>>>>>>> force it to use IPv4? >>>>>>>>>> >>>>>>>>>> - Adam >>>>>>>>>> >>>>>>>>>> On Fri, Aug 25, 2023 at 1:35 PM Andrew Rybchenko < >>>>>>>>>> andrew.rybchenko@oktetlabs.ru> wrote: >>>>>>>>>> >>>>>>>>>>> > I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04. >>>>>>>>>>> >>>>>>>>>>> Done. It works fine for me without any issues. >>>>>>>>>>> >>>>>>>>>>> Have you noticed any pattern when it works or does not work? >>>>>>>>>>> May be it is a problem of not clean state after termination? >>>>>>>>>>> Does it work fine the first time after DUTs reboot? >>>>>>>>>>> How do you terminate testing? It should be done using Ctrl+C in >>>>>>>>>>> terminal where you execute run.sh command. >>>>>>>>>>> In this case it should shutdown gracefully and close all test >>>>>>>>>>> agents and engine applications. >>>>>>>>>>> >>>>>>>>>>> (I'm trying to understand why you've seen many test agent >>>>>>>>>>> processes. It should not happen.) >>>>>>>>>>> >>>>>>>>>>> Andrew. >>>>>>>>>>> >>>>>>>>>>> On 8/25/23 17:41, Andrew Rybchenko wrote: >>>>>>>>>>> >>>>>>>>>>> On 8/25/23 17:06, Adam Hassick wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> Two of our systems (the Test Engine runner and the DUT host) are >>>>>>>>>>> running Ubuntu 20.04 LTS, however this morning I noticed that the tester >>>>>>>>>>> system (the one having issues) is running Ubuntu 22.04 LTS. >>>>>>>>>>> This could be the source of the problem. I encountered a >>>>>>>>>>> dependency issue trying to run the Test Engine on 22.04 LTS, so I >>>>>>>>>>> downgraded the system. Since the tester is also the host having connection >>>>>>>>>>> issues, I will try downgrading that system to 20.04, and see if that >>>>>>>>>>> changes anything. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Unlikely, but who knows. We run tests (DUTs) on Ubuntu 20.04, >>>>>>>>>>> Ubuntu 22.04, Ubuntu 22.10, Ubuntu 23.04, Debian 11 and Fedora 38 every >>>>>>>>>>> night. >>>>>>>>>>> Right now Debian 11 is used for test engine in nightly >>>>>>>>>>> regressions. >>>>>>>>>>> >>>>>>>>>>> I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04. >>>>>>>>>>> >>>>>>>>>>> I did try passing in the "--vg-rcf" argument to the run.sh >>>>>>>>>>> script of the test suite after installing valgrind, but there was no >>>>>>>>>>> additional output that I saw. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sorry, I should valgrind output should be in valgrind.te_rcf >>>>>>>>>>> (direction where you run test engine). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I will try pulling in the changes you've pushed up, and will see >>>>>>>>>>> if that fixes anything. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Adam >>>>>>>>>>> >>>>>>>>>>> On Fri, Aug 25, 2023 at 9:57 AM Andrew Rybchenko < >>>>>>>>>>> andrew.rybchenko@oktetlabs.ru> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello Adam, >>>>>>>>>>>> >>>>>>>>>>>> On 8/24/23 23:54, Andrew Rybchenko wrote: >>>>>>>>>>>> >>>>>>>>>>>> I'd like to try to repeat the problem locally. Which Linux >>>>>>>>>>>> distro is running on test engine and agents? >>>>>>>>>>>> >>>>>>>>>>>> In fact I know one problem with Debian 12 and Fedora 38 and we >>>>>>>>>>>> have >>>>>>>>>>>> patch in review to fix it, however, the behaviour is different >>>>>>>>>>>> in >>>>>>>>>>>> this case, so it is unlike the same problem. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I've just published a new tag which fixes known test engine >>>>>>>>>>>> side problems on Debian 12 and Fedora 38. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> One more idea is to install valgrind on the test engine host and >>>>>>>>>>>> run with option --vg-rcf to check if something weird is >>>>>>>>>>>> happening. >>>>>>>>>>>> >>>>>>>>>>>> What I don't understand right now is why I see just one failed >>>>>>>>>>>> attempt >>>>>>>>>>>> to connect in your log.txt and then Logger shutdown after 9 >>>>>>>>>>>> minutes. >>>>>>>>>>>> >>>>>>>>>>>> Andrew. >>>>>>>>>>>> >>>>>>>>>>>> On 8/24/23 23:29, Adam Hassick wrote: >>>>>>>>>>>> >>>>>>>>>>>> > Is there any firewall in the network or on test hosts which >>>>>>>>>>>> could block incoming TCP connection to the port 23571 >>>>>>>>>>>> >>>>>>>>>>>> from the >>>>>>>>>>>> host where you run test engine? >>>>>>>>>>>> >>>>>>>>>>>> Our test engine host and the testbed are on the same subnet. >>>>>>>>>>>> The connection does work sometimes. >>>>>>>>>>>> >>>>>>>>>>>> > If behaviour the same on the next try and you see that test >>>>>>>>>>>> agent is kept running, could you check using >>>>>>>>>>>> > >>>>>>>>>>>> > # netstat -tnlp >>>>>>>>>>>> > >>>>>>>>>>>> > that Test Agent is listening on the port and try to >>>>>>>>>>>> establish TCP connection from test agent using >>>>>>>>>>>> > >>>>>>>>>>>> > $ telnet iol-dts-tester.dpdklab.iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> 23571 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> > >>>>>>>>>>>> > and check if TCP connection could be established. >>>>>>>>>>>> >>>>>>>>>>>> I was able to replicate the same behavior again, where it hangs >>>>>>>>>>>> while RCF is trying to start. >>>>>>>>>>>> Running this command, I see this in the output: >>>>>>>>>>>> >>>>>>>>>>>> tcp 0 0 0.0.0.0:23571 >>>>>>>>>>>> 0.0.0.0:* >>>>>>>>>>>> LISTEN 18599/ta >>>>>>>>>>>> >>>>>>>>>>>> So it seems like it is listening on the correct port. >>>>>>>>>>>> Additionally, I was able to connect to the Tester machine from >>>>>>>>>>>> our Test Engine host using telnet. It printed the PID of the process once >>>>>>>>>>>> the connection was opened. >>>>>>>>>>>> >>>>>>>>>>>> I tried running the "ta" application manually on the command >>>>>>>>>>>> line, and it didn't print anything at all. >>>>>>>>>>>> Maybe the issue is something on the Test Engine side. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko < >>>>>>>>>>>> andrew.rybchenko@oktetlabs.ru >>>>>>>>>>>> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Adam, >>>>>>>>>>>> >>>>>>>>>>>> > On the tester host (which appears to be the Peer agent), >>>>>>>>>>>> there >>>>>>>>>>>> are four processes that I see running, which look like the >>>>>>>>>>>> test >>>>>>>>>>>> agent processes. >>>>>>>>>>>> >>>>>>>>>>>> Before the next try I'd recommend to kill these processes. >>>>>>>>>>>> >>>>>>>>>>>> Is there any firewall in the network or on test hosts which >>>>>>>>>>>> could >>>>>>>>>>>> block incoming TCP connection to the port 23571 >>>>>>>>>>>> >>>>>>>>>>>> from the host >>>>>>>>>>>> where you run test engine? >>>>>>>>>>>> >>>>>>>>>>>> If behaviour the same on the next try and you see that test >>>>>>>>>>>> agent is >>>>>>>>>>>> kept running, could you check using >>>>>>>>>>>> >>>>>>>>>>>> # netstat -tnlp >>>>>>>>>>>> >>>>>>>>>>>> that Test Agent is listening on the port and try to >>>>>>>>>>>> establish TCP >>>>>>>>>>>> connection from test agent using >>>>>>>>>>>> >>>>>>>>>>>> $ telnet iol-dts-tester.dpdklab.iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> 23571 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> and check if TCP connection could be established. >>>>>>>>>>>> >>>>>>>>>>>> Another idea is to login Tester under root as testing does, >>>>>>>>>>>> get >>>>>>>>>>>> start TA command from the log and try it by hands without >>>>>>>>>>>> -n and >>>>>>>>>>>> remove extra escaping. >>>>>>>>>>>> >>>>>>>>>>>> # sudo PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1 >>>>>>>>>>>> >>>>>>>>>>>> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1 >>>>>>>>>>>> /tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571 >>>>>>>>>>>> host=iol-dts-tester.dpdklab.iol.unh.edu: >>>>>>>>>>>> port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell= >>>>>>>>>>>> >>>>>>>>>>>> Hopefully in this case test agent directory remains in the >>>>>>>>>>>> /tmp and >>>>>>>>>>>> you don't need to copy it as testing does. >>>>>>>>>>>> May be output could shed some light on what's going on. >>>>>>>>>>>> >>>>>>>>>>>> Andrew. >>>>>>>>>>>> >>>>>>>>>>>> On 8/24/23 17:30, Adam Hassick wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> This is the output that I see in the terminal when this >>>>>>>>>>>> failure >>>>>>>>>>>> occurs, after the test agent binaries build and the test >>>>>>>>>>>> engine >>>>>>>>>>>> starts: >>>>>>>>>>>> >>>>>>>>>>>> Platform default build - pass >>>>>>>>>>>> Simple RCF consistency check succeeded >>>>>>>>>>>> --->>> Starting Logger...done >>>>>>>>>>>> --->>> Starting RCF...rcf_net_engine_connect(): Connection >>>>>>>>>>>> timed >>>>>>>>>>>> out iol-dts-tester.dpdklab.iol.unh.edu:23571 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Then, it hangs here until I kill the "te_rcf" and "te_tee" >>>>>>>>>>>> processes. I let it hang for around 9 minutes. >>>>>>>>>>>> >>>>>>>>>>>> On the tester host (which appears to be the Peer agent), >>>>>>>>>>>> there are >>>>>>>>>>>> four processes that I see running, which look like the test >>>>>>>>>>>> agent >>>>>>>>>>>> processes. >>>>>>>>>>>> >>>>>>>>>>>> ta.Peer is an empty file. I've attached the log.txt from >>>>>>>>>>>> this run. >>>>>>>>>>>> >>>>>>>>>>>> - Adam >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 24, 2023 at 4:22��AM Andrew Rybchenko >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Adam, >>>>>>>>>>>> >>>>>>>>>>>> Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've >>>>>>>>>>>> double-checked >>>>>>>>>>>> that it goes to 'copy_timeout' in ts-conf/rcf.conf. >>>>>>>>>>>> Description in in >>>>>>>>>>>> doc/sphinx/pages/group_te_engine_rcf.rst >>>>>>>>>>>> says that copy_timeout is in seconds and implementation >>>>>>>>>>>> in >>>>>>>>>>>> lib/rcfunix/rcfunix.c passes the value to select() >>>>>>>>>>>> tv_sec. >>>>>>>>>>>> Theoretically select() could be interrupted by signal, >>>>>>>>>>>> but I >>>>>>>>>>>> think it is unlikely here. >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure that I understand what do you mean by RCF >>>>>>>>>>>> connection timeout. Does it happen on TE startup when >>>>>>>>>>>> RCF >>>>>>>>>>>> starts test agents. If so, TE_RCFUNIX_TIMEOUT could >>>>>>>>>>>> help. Or >>>>>>>>>>>> does it happen when tests are in progress, e.g. in the >>>>>>>>>>>> middle >>>>>>>>>>>> of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated and >>>>>>>>>>>> most >>>>>>>>>>>> likely either host with test agent dies or test agent >>>>>>>>>>>> itself >>>>>>>>>>>> crashes. It would be easier for me if classify it if >>>>>>>>>>>> you share >>>>>>>>>>>> text log (log.txt, full or just corresponding fragment >>>>>>>>>>>> with >>>>>>>>>>>> some context). Also content of ta.DPDK or ta.Peer file >>>>>>>>>>>> depending on which agent has problems could shed some >>>>>>>>>>>> light. >>>>>>>>>>>> Corresponding files contain stdout/stderr of test >>>>>>>>>>>> agents. >>>>>>>>>>>> >>>>>>>>>>>> Andrew. >>>>>>>>>>>> >>>>>>>>>>>> On 8/23/23 17:45, Adam Hassick wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> I've set up a test rig repository here, and have created >>>>>>>>>>>> configurations for our development testbed based off of >>>>>>>>>>>> the >>>>>>>>>>>> examples. >>>>>>>>>>>> We've been able to get the test suite to run manually on >>>>>>>>>>>> Mellanox CX5 devices once. >>>>>>>>>>>> However, we are running into an issue where, when RCF >>>>>>>>>>>> starts, >>>>>>>>>>>> the RCF connection times out very frequently. We aren't >>>>>>>>>>>> sure >>>>>>>>>>>> why this is the case. >>>>>>>>>>>> It works sometimes, but most of the time when we try to >>>>>>>>>>>> run >>>>>>>>>>>> the test engine, it encounters this issue. >>>>>>>>>>>> I've tried changing the RCF port by setting >>>>>>>>>>>> "TE_RCF_PORT=" and rebooting the >>>>>>>>>>>> testbed >>>>>>>>>>>> machines. Neither seems to fix the issue. >>>>>>>>>>>> >>>>>>>>>>>> It also seems like the timeout takes far longer than 60 >>>>>>>>>>>> seconds, even when running "export >>>>>>>>>>>> TE_RCFUNIX_TIMEOUT=60" >>>>>>>>>>>> before I try to run the test suite. >>>>>>>>>>>> I assume the unit for this variable is seconds? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Adam >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick >>>>>>>>>>>> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> Thanks, I've cloned the example repository and will >>>>>>>>>>>> start >>>>>>>>>>>> setting up a configuration for our development >>>>>>>>>>>> testbed >>>>>>>>>>>> today. I'll let you know if I run into any >>>>>>>>>>>> difficulties >>>>>>>>>>>> or have any questions. >>>>>>>>>>>> >>>>>>>>>>>> - Adam >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Aug 20, 2023 at 4:40 AM Andrew Rybchenko >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Adam, >>>>>>>>>>>> >>>>>>>>>>>> I've published >>>>>>>>>>>> https://github.com/ts-factory/ts-rigs-sample >>>>>>>>>>>> >>>>>>>>>>>> . >>>>>>>>>>>> Hopefully it will help to define your test rigs >>>>>>>>>>>> and >>>>>>>>>>>> successfully run some tests manually. Feel free >>>>>>>>>>>> to >>>>>>>>>>>> ask any questions and I'll answer here and try >>>>>>>>>>>> to >>>>>>>>>>>> update documentation. >>>>>>>>>>>> >>>>>>>>>>>> Meanwhile I'll prepare missing bits for steps >>>>>>>>>>>> (2) and >>>>>>>>>>>> (3). >>>>>>>>>>>> Hopefully everything is in place for step (4), >>>>>>>>>>>> but we >>>>>>>>>>>> need to make steps (2) and (3) first. >>>>>>>>>>>> >>>>>>>>>>>> Andrew. >>>>>>>>>>>> >>>>>>>>>>>> On 8/18/23 21:40, Andrew Rybchenko wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Adam, >>>>>>>>>>>> >>>>>>>>>>>> > I've conferred with the rest of the team, and >>>>>>>>>>>> we >>>>>>>>>>>> think it would be best to move forward with >>>>>>>>>>>> mainly >>>>>>>>>>>> option B. >>>>>>>>>>>> >>>>>>>>>>>> OK, I'll provide the sample on Monday for you. >>>>>>>>>>>> It is >>>>>>>>>>>> almost ready right now, but I need to >>>>>>>>>>>> double-check >>>>>>>>>>>> it before publishing. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Andrew. >>>>>>>>>>>> >>>>>>>>>>>> On 8/17/23 20:03, Adam Hassick wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> I'm adding the CI mailing list to this >>>>>>>>>>>> conversation. Others in the community might find >>>>>>>>>>>> this conversation valuable. >>>>>>>>>>>> >>>>>>>>>>>> We do want to run testing on a regular basis. >>>>>>>>>>>> The >>>>>>>>>>>> Jenkins integration will be very useful for us, >>>>>>>>>>>> as >>>>>>>>>>>> most of our CI is orchestrated by Jenkins. >>>>>>>>>>>> I've conferred with the rest of the team, and we >>>>>>>>>>>> think it would be best to move forward with >>>>>>>>>>>> mainly >>>>>>>>>>>> option B. >>>>>>>>>>>> If you would like to know anything about our >>>>>>>>>>>> testbeds that would help you with creating an >>>>>>>>>>>> example ts-rigs repo, I'd be happy to answer any >>>>>>>>>>>> questions you have. >>>>>>>>>>>> >>>>>>>>>>>> We have multiple test rigs (we call these >>>>>>>>>>>> "DUT-tester pairs") that we run our existing >>>>>>>>>>>> hardware testing on, with differing network >>>>>>>>>>>> hardware and CPU architecture. I figured this >>>>>>>>>>>> might >>>>>>>>>>>> be an important detail. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Adam >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 17, 2023 at 11:44 AM Andrew >>>>>>>>>>>> Rybchenko >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>> Greatings Adam, >>>>>>>>>>>> >>>>>>>>>>>> I'm happy to hear that you're trying to >>>>>>>>>>>> bring >>>>>>>>>>>> it up. >>>>>>>>>>>> >>>>>>>>>>>> As I understand the final goal is to run it >>>>>>>>>>>> on >>>>>>>>>>>> regular basis. So, we need to make it >>>>>>>>>>>> properly >>>>>>>>>>>> from the very beginning. >>>>>>>>>>>> Bring up of all features consists of 4 >>>>>>>>>>>> steps: >>>>>>>>>>>> >>>>>>>>>>>> 1. Create site-specific repository (we call >>>>>>>>>>>> it >>>>>>>>>>>> ts-rigs) which contains information about >>>>>>>>>>>> test >>>>>>>>>>>> rigs and other site-specific information >>>>>>>>>>>> like >>>>>>>>>>>> where to send mails, where to store logs >>>>>>>>>>>> etc. >>>>>>>>>>>> It is required for manual execution as well, >>>>>>>>>>>> since test rigs description is essential. >>>>>>>>>>>> I'll >>>>>>>>>>>> return to the topic below. >>>>>>>>>>>> >>>>>>>>>>>> 2. Setup logs storage for automated runs. >>>>>>>>>>>> Basically it is a disk space plus apache2 >>>>>>>>>>>> web >>>>>>>>>>>> server with few CGI scripts which help a >>>>>>>>>>>> lot to >>>>>>>>>>>> save disk space. >>>>>>>>>>>> >>>>>>>>>>>> 3. Setup Bublik web application which >>>>>>>>>>>> provides >>>>>>>>>>>> web interface to view testing results. Same >>>>>>>>>>>> as >>>>>>>>>>>> https://ts-factory.io/bublik >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 4. Setup Jenkins to run tests on regularly, >>>>>>>>>>>> save logs in log storage (2) and import it >>>>>>>>>>>> to >>>>>>>>>>>> bublik (3). >>>>>>>>>>>> >>>>>>>>>>>> Last few month we spent on our homework to >>>>>>>>>>>> make >>>>>>>>>>>> it simpler to bring up automated execution >>>>>>>>>>>> using Jenkins - >>>>>>>>>>>> https://github.com/ts-factory/te-jenkins >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Corresponding bits in dpdk-ethdev-ts will be >>>>>>>>>>>> available tomorrow. >>>>>>>>>>>> >>>>>>>>>>>> Let's return to the step (1). >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately there is no publicly available >>>>>>>>>>>> example of the ts-rigs repository since >>>>>>>>>>>> sensitive site-specific information is >>>>>>>>>>>> located >>>>>>>>>>>> there. But I'm ready to help you to create >>>>>>>>>>>> it >>>>>>>>>>>> for UNH. I see two options here: >>>>>>>>>>>> >>>>>>>>>>>> (A) I'll ask questions and based on your >>>>>>>>>>>> answers will create the first draft with my >>>>>>>>>>>> comments. >>>>>>>>>>>> >>>>>>>>>>>> (B) I'll make a template/example ts-rigs >>>>>>>>>>>> repo, >>>>>>>>>>>> publish it and you'll create UNH ts-rigs >>>>>>>>>>>> based >>>>>>>>>>>> on it. >>>>>>>>>>>> >>>>>>>>>>>> Of course, I'll help to debug and finally >>>>>>>>>>>> bring >>>>>>>>>>>> it up in any case. >>>>>>>>>>>> >>>>>>>>>>>> (A) is a bit simpler for me and you, but >>>>>>>>>>>> (B) is >>>>>>>>>>>> a bit more generic and will help other >>>>>>>>>>>> potential users to bring it up. >>>>>>>>>>>> We can combine (A)+(B). I.e. start from (A). >>>>>>>>>>>> What do you think? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Andrew. >>>>>>>>>>>> >>>>>>>>>>>> On 8/17/23 15:18, Konstantin Ushakov wrote: >>>>>>>>>>>> >>>>>>>>>>>> Greetings Adam, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks for contacting us. I copy Andrew who >>>>>>>>>>>> would be happy to help >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Konstantin >>>>>>>>>>>> >>>>>>>>>>>> On 16 Aug 2023, at 21:50, Adam Hassick >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>  >>>>>>>>>>>> Greetings Konstantin, >>>>>>>>>>>> >>>>>>>>>>>> I am in the process of setting up the DPDK >>>>>>>>>>>> Poll Mode Driver test suite as an addition >>>>>>>>>>>> to >>>>>>>>>>>> our testing coverage for DPDK at the UNH >>>>>>>>>>>> lab. >>>>>>>>>>>> >>>>>>>>>>>> I have some questions about how to set the >>>>>>>>>>>> test suite arguments. >>>>>>>>>>>> >>>>>>>>>>>> I have been able to configure the Test >>>>>>>>>>>> Engine >>>>>>>>>>>> to connect to the hosts in the testbed. The >>>>>>>>>>>> RCF, Configurator, and Tester all begin to >>>>>>>>>>>> run, however the prelude of the test suite >>>>>>>>>>>> fails to run. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The documentation mentions that there are >>>>>>>>>>>> several test parameters for the test suite, >>>>>>>>>>>> like for the IUT test link MAC, etc. These >>>>>>>>>>>> seem like they would need to be set >>>>>>>>>>>> somewhere >>>>>>>>>>>> to run many of the tests. >>>>>>>>>>>> >>>>>>>>>>>> I see in the Test Engine documentation, >>>>>>>>>>>> there >>>>>>>>>>>> are instructions on how to create new >>>>>>>>>>>> parameters for test suites in the Tester >>>>>>>>>>>> configuration, but there is nothing in the >>>>>>>>>>>> user guide or in the Tester guide for how to >>>>>>>>>>>> set the arguments for the parameters when >>>>>>>>>>>> running the test suite that I can find. I'm >>>>>>>>>>>> not sure if I need to write my own Tester >>>>>>>>>>>> config, or if I should be setting these in >>>>>>>>>>>> some other way. >>>>>>>>>>>> >>>>>>>>>>>> How should these values be set? >>>>>>>>>>>> >>>>>>>>>>>> I'm also not sure what environment >>>>>>>>>>>> variables/arguments are strictly necessary >>>>>>>>>>>> or >>>>>>>>>>>> which are optional. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Adam >>>>>>>>>>>> >>>>>>>>>>>> -- *Adam Hassick* >>>>>>>>>>>> Senior Developer >>>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- *Adam Hassick* >>>>>>>>>>>> Senior Developer >>>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- *Adam Hassick* >>>>>>>>>>>> Senior Developer >>>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- *Adam Hassick* >>>>>>>>>>>> Senior Developer >>>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- *Adam Hassick* >>>>>>>>>>>> Senior Developer >>>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> *Adam Hassick* >>>>>>>>>>>> Senior Developer >>>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> iol.unh.edu >>>>>>>>>>>> >>>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Adam Hassick* >>>>>>>>>>> Senior Developer >>>>>>>>>>> UNH InterOperability Lab >>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>> iol.unh.edu >>>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Adam Hassick* >>>>>>>>>> Senior Developer >>>>>>>>>> UNH InterOperability Lab >>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>> iol.unh.edu >>>>>>>>>> +1 (603) 475-8248 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Adam Hassick* >>>>>>>>> Senior Developer >>>>>>>>> UNH InterOperability Lab >>>>>>>>> ahassick@iol.unh.edu >>>>>>>>> iol.unh.edu >>>>>>>>> +1 (603) 475-8248 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Adam Hassick* >>>>>>> Senior Developer >>>>>>> UNH InterOperability Lab >>>>>>> ahassick@iol.unh.edu >>>>>>> iol.unh.edu >>>>>>> +1 (603) 475-8248 >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> *Adam Hassick* >>>>>> Senior Developer >>>>>> UNH InterOperability Lab >>>>>> ahassick@iol.unh.edu >>>>>> iol.unh.edu >>>>>> +1 (603) 475-8248 >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Adam Hassick* >>>>> Senior Developer >>>>> UNH InterOperability Lab >>>>> ahassick@iol.unh.edu >>>>> iol.unh.edu >>>>> +1 (603) 475-8248 >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> *Adam Hassick* >>>> Senior Developer >>>> UNH InterOperability Lab >>>> ahassick@iol.unh.edu >>>> iol.unh.edu >>>> +1 (603) 475-8248 >>>> >>>> >>>> >>> >>> -- >>> *Adam Hassick* >>> Senior Developer >>> UNH InterOperability Lab >>> ahassick@iol.unh.edu >>> iol.unh.edu >>> +1 (603) 475-8248 >>> >>> >>> >> >> -- >> *Adam Hassick* >> Senior Developer >> UNH InterOperability Lab >> ahassick@iol.unh.edu >> iol.unh.edu >> +1 (603) 475-8248 >> >> >> >