Hello Adam, On 8/24/23 23:54, Andrew Rybchenko wrote: > I'd like to try to repeat the problem locally. Which Linux distro is > running on test engine and agents? > > In fact I know one problem with Debian 12 and Fedora 38 and we have > patch in review to fix it, however, the behaviour is different in > this case, so it is unlike the same problem. I've just published a new tag which fixes known test engine side problems on Debian 12 and Fedora 38. > > One more idea is to install valgrind on the test engine host and > run with option --vg-rcf to check if something weird is happening. > > What I don't understand right now is why I see just one failed attempt > to connect in your log.txt and then Logger shutdown after 9 minutes. > > Andrew. > > On 8/24/23 23:29, Adam Hassick wrote: >>  > Is there any firewall in the network or on test hosts which could >> block incoming TCP connection to the port 23571 >> from the host where >> you run test engine? >> >> Our test engine host and the testbed are on the same subnet. The >> connection does work sometimes. >> >>  > If behaviour the same on the next try and you see that test agent >> is kept running, could you check using >>  > >>  > # netstat -tnlp >>  > >>  > that Test Agent is listening on the port and try to establish TCP >> connection from test agent using >>  > >>  > $ telnet iol-dts-tester.dpdklab.iol.unh.edu >> 23571 >> >>  > >>  > and check if TCP connection could be established. >> >> I was able to replicate the same behavior again, where it hangs while >> RCF is trying to start. >> Running this command, I see this in the output: >> >> tcp        0      0 0.0.0.0:23571            >> 0.0.0.0:* LISTEN      18599/ta >> >> So it seems like it is listening on the correct port. >> Additionally, I was able to connect to the Tester machine from our >> Test Engine host using telnet. It printed the PID of the process once >> the connection was opened. >> >> I tried running the "ta" application manually on the command line, >> and it didn't print anything at all. >> Maybe the issue is something on the Test Engine side. >> >> On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko >> > > wrote: >> >>     Hi Adam, >> >>      > On the tester host (which appears to be the Peer agent), there >>     are four processes that I see running, which look like the test >>     agent processes. >> >>     Before the next try I'd recommend to kill these processes. >> >>     Is there any firewall in the network or on test hosts which could >>     block incoming TCP connection to the port 23571 >> from the host >>     where you run test engine? >> >>     If behaviour the same on the next try and you see that test agent is >>     kept running, could you check using >> >>     # netstat -tnlp >> >>     that Test Agent is listening on the port and try to establish TCP >>     connection from test agent using >> >>     $ telnet iol-dts-tester.dpdklab.iol.unh.edu >> 23571 >> >> >>     and check if TCP connection could be established. >> >>     Another idea is to login Tester under root as testing does, get >>     start TA command from the log and try it by hands without -n and >>     remove extra escaping. >> >>     # sudo PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1 >> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1 >> /tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571 >> host=iol-dts-tester.dpdklab.iol.unh.edu:port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell= >> >>     Hopefully in this case test agent directory remains in the /tmp and >>     you don't need to copy it as testing does. >>     May be output could shed some light on what's going on. >> >>     Andrew. >> >>     On 8/24/23 17:30, Adam Hassick wrote: >>>     Hi Andrew, >>> >>>     This is the output that I see in the terminal when this failure >>>     occurs, after the test agent binaries build and the test engine >>>     starts: >>> >>>     Platform default build - pass >>>     Simple RCF consistency check succeeded >>>     --->>> Starting Logger...done >>>     --->>> Starting RCF...rcf_net_engine_connect(): Connection timed >>>     out iol-dts-tester.dpdklab.iol.unh.edu:23571 >>> >>> >>>     Then, it hangs here until I kill the "te_rcf" and "te_tee" >>>     processes. I let it hang for around 9 minutes. >>> >>>     On the tester host (which appears to be the Peer agent), there are >>>     four processes that I see running, which look like the test agent >>>     processes. >>> >>>     ta.Peer is an empty file. I've attached the log.txt from this run. >>> >>>      - Adam >>> >>>     On Thu, Aug 24, 2023 at 4:22 AM Andrew Rybchenko >>>     >> > wrote: >>> >>>         Hi Adam, >>> >>>         Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've double-checked >>>         that it goes to 'copy_timeout' in ts-conf/rcf.conf. >>>         Description in in doc/sphinx/pages/group_te_engine_rcf.rst >>>         says that copy_timeout is in seconds and implementation in >>>         lib/rcfunix/rcfunix.c passes the value to select() tv_sec. >>>         Theoretically select() could be interrupted by signal, but I >>>         think it is unlikely here. >>> >>>         I'm not sure that I understand what do you mean by RCF >>>         connection timeout. Does it happen on TE startup when RCF >>>         starts test agents. If so, TE_RCFUNIX_TIMEOUT could help. Or >>>         does it happen when tests are in progress, e.g. in the middle >>>         of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated and most >>>         likely either host with test agent dies or test agent itself >>>         crashes. It would be easier for me if classify it if you share >>>         text log (log.txt, full or just corresponding fragment with >>>         some context). Also content of ta.DPDK or ta.Peer file >>>         depending on which agent has problems could shed some light. >>>         Corresponding files contain stdout/stderr of test agents. >>> >>>         Andrew. >>> >>>         On 8/23/23 17:45, Adam Hassick wrote: >>>>         Hi Andrew, >>>> >>>>         I've set up a test rig repository here, and have created >>>>         configurations for our development testbed based off of the >>>>         examples. >>>>         We've been able to get the test suite to run manually on >>>>         Mellanox CX5 devices once. >>>>         However, we are running into an issue where, when RCF starts, >>>>         the RCF connection times out very frequently. We aren't sure >>>>         why this is the case. >>>>         It works sometimes, but most of the time when we try to run >>>>         the test engine, it encounters this issue. >>>>         I've tried changing the RCF port by setting >>>>         "TE_RCF_PORT=" and rebooting the testbed >>>>         machines. Neither seems to fix the issue. >>>> >>>>         It also seems like the timeout takes far longer than 60 >>>>         seconds, even when running "export TE_RCFUNIX_TIMEOUT=60" >>>>         before I try to run the test suite. >>>>         I assume the unit for this variable is seconds? >>>> >>>>         Thanks, >>>>         Adam >>>> >>>>         On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick >>>>         > wrote: >>>> >>>>             Hi Andrew, >>>> >>>>             Thanks, I've cloned the example repository and will start >>>>             setting up a configuration for our development testbed >>>>             today. I'll let you know if I run into any difficulties >>>>             or have any questions. >>>> >>>>              - Adam >>>> >>>>             On Sun, Aug 20, 2023 at 4:40 AM Andrew Rybchenko >>>>             >>> > wrote: >>>> >>>>                 Hi Adam, >>>> >>>>                 I've published >>>> https://github.com/ts-factory/ts-rigs-sample >>>> . >>>>                 Hopefully it will help to define your test rigs and >>>>                 successfully run some tests manually. Feel free to >>>>                 ask any questions and I'll answer here and try to >>>>                 update documentation. >>>> >>>>                 Meanwhile I'll prepare missing bits for steps (2) and >>>>                 (3). >>>>                 Hopefully everything is in place for step (4), but we >>>>                 need to make steps (2) and (3) first. >>>> >>>>                 Andrew. >>>> >>>>                 On 8/18/23 21:40, Andrew Rybchenko wrote: >>>>>                 Hi Adam, >>>>> >>>>>                 > I've conferred with the rest of the team, and we >>>>>                 think it would be best to move forward with mainly >>>>>                 option B. >>>>> >>>>>                 OK, I'll provide the sample on Monday for you. It is >>>>>                 almost ready right now, but I need to double-check >>>>>                 it before publishing. >>>>> >>>>>                 Regards, >>>>>                 Andrew. >>>>> >>>>>                 On 8/17/23 20:03, Adam Hassick wrote: >>>>>>                 Hi Andrew, >>>>>> >>>>>>                 I'm adding the CI mailing list to this >>>>>>                 conversation. Others in the community might find >>>>>>                 this conversation valuable. >>>>>> >>>>>>                 We do want to run testing on a regular basis. The >>>>>>                 Jenkins integration will be very useful for us, as >>>>>>                 most of our CI is orchestrated by Jenkins. >>>>>>                 I've conferred with the rest of the team, and we >>>>>>                 think it would be best to move forward with mainly >>>>>>                 option B. >>>>>>                 If you would like to know anything about our >>>>>>                 testbeds that would help you with creating an >>>>>>                 example ts-rigs repo, I'd be happy to answer any >>>>>>                 questions you have. >>>>>> >>>>>>                 We have multiple test rigs (we call these >>>>>>                 "DUT-tester pairs") that we run our existing >>>>>>                 hardware testing on, with differing network >>>>>>                 hardware and CPU architecture. I figured this might >>>>>>                 be an important detail. >>>>>> >>>>>>                 Thanks, >>>>>>                 Adam >>>>>> >>>>>>                 On Thu, Aug 17, 2023 at 11:44 AM Andrew Rybchenko >>>>>>                 >>>>> > wrote: >>>>>> >>>>>>                     Greatings Adam, >>>>>> >>>>>>                     I'm happy to hear that you're trying to bring >>>>>>                     it up. >>>>>> >>>>>>                     As I understand the final goal is to run it on >>>>>>                     regular basis. So, we need to make it properly >>>>>>                     from the very beginning. >>>>>>                     Bring up of all features consists of 4 steps: >>>>>> >>>>>>                     1. Create site-specific repository (we call it >>>>>>                     ts-rigs) which contains information about test >>>>>>                     rigs and other site-specific information like >>>>>>                     where to send mails, where to store logs etc. >>>>>>                     It is required for manual execution as well, >>>>>>                     since test rigs description is essential. I'll >>>>>>                     return to the topic below. >>>>>> >>>>>>                     2. Setup logs storage for automated runs. >>>>>>                     Basically it is a disk space plus apache2 web >>>>>>                     server with few CGI scripts which help a lot to >>>>>>                     save disk space. >>>>>> >>>>>>                     3. Setup Bublik web application which provides >>>>>>                     web interface to view testing results. Same as >>>>>> https://ts-factory.io/bublik >>>>>> >>>>>> >>>>>>                     4. Setup Jenkins to run tests on regularly, >>>>>>                     save logs in log storage (2) and import it to >>>>>>                     bublik (3). >>>>>> >>>>>>                     Last few month we spent on our homework to make >>>>>>                     it simpler to bring up automated execution >>>>>>                     using Jenkins - >>>>>> https://github.com/ts-factory/te-jenkins >>>>>> >>>>>>                     Corresponding bits in dpdk-ethdev-ts will be >>>>>>                     available tomorrow. >>>>>> >>>>>>                     Let's return to the step (1). >>>>>> >>>>>>                     Unfortunately there is no publicly available >>>>>>                     example of the ts-rigs repository since >>>>>>                     sensitive site-specific information is located >>>>>>                     there. But I'm ready to help you to create it >>>>>>                     for UNH. I see two options here: >>>>>> >>>>>>                     (A) I'll ask questions and based on your >>>>>>                     answers will create the first draft with my >>>>>>                     comments. >>>>>> >>>>>>                     (B) I'll make a template/example ts-rigs repo, >>>>>>                     publish it and you'll create UNH ts-rigs based >>>>>>                     on it. >>>>>> >>>>>>                     Of course, I'll help to debug and finally bring >>>>>>                     it up in any case. >>>>>> >>>>>>                     (A) is a bit simpler for me and you, but (B) is >>>>>>                     a bit more generic and will help other >>>>>>                     potential users to bring it up. >>>>>>                     We can combine (A)+(B). I.e. start from (A). >>>>>>                     What do you think? >>>>>> >>>>>>                     Thanks, >>>>>>                     Andrew. >>>>>> >>>>>>                     On 8/17/23 15:18, Konstantin Ushakov wrote: >>>>>>>                     Greetings Adam, >>>>>>> >>>>>>> >>>>>>>                     Thanks for contacting us. I copy Andrew who >>>>>>>                     would be happy to help >>>>>>> >>>>>>>                     Thanks, >>>>>>>                     Konstantin >>>>>>> >>>>>>>>                     On 16 Aug 2023, at 21:50, Adam Hassick >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>                      >>>>>>>>                     Greetings Konstantin, >>>>>>>> >>>>>>>>                     I am in the process of setting up the DPDK >>>>>>>>                     Poll Mode Driver test suite as an addition to >>>>>>>>                     our testing coverage for DPDK at the UNH lab. >>>>>>>> >>>>>>>>                     I have some questions about how to set the >>>>>>>>                     test suite arguments. >>>>>>>> >>>>>>>>                     I have been able to configure the Test Engine >>>>>>>>                     to connect to the hosts in the testbed. The >>>>>>>>                     RCF, Configurator, and Tester all begin to >>>>>>>>                     run, however the prelude of the test suite >>>>>>>>                     fails to run. >>>>>>>> >>>>>>>> https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>                     The documentation mentions that there are >>>>>>>>                     several test parameters for the test suite, >>>>>>>>                     like for the IUT test link MAC, etc. These >>>>>>>>                     seem like they would need to be set somewhere >>>>>>>>                     to run many of the tests. >>>>>>>> >>>>>>>>                     I see in the Test Engine documentation, there >>>>>>>>                     are instructions on how to create new >>>>>>>>                     parameters for test suites in the Tester >>>>>>>>                     configuration, but there is nothing in the >>>>>>>>                     user guide or in the Tester guide for how to >>>>>>>>                     set the arguments for the parameters when >>>>>>>>                     running the test suite that I can find. I'm >>>>>>>>                     not sure if I need to write my own Tester >>>>>>>>                     config, or if I should be setting these in >>>>>>>>                     some other way. >>>>>>>> >>>>>>>>                     How should these values be set? >>>>>>>> >>>>>>>>                     I'm also not sure what environment >>>>>>>>                     variables/arguments are strictly necessary or >>>>>>>>                     which are optional. >>>>>>>> >>>>>>>>                     Regards, >>>>>>>>                     Adam >>>>>>>> >>>>>>>>                     --                     *Adam Hassick* >>>>>>>>                     Senior Developer >>>>>>>>                     UNH InterOperability Lab >>>>>>>> ahassick@iol.unh.edu >>>>>>>> >>>>>>>>                     iol.unh.edu >>>>>>>>                     +1 (603) 475-8248 >>>>>> >>>>>> >>>>>> >>>>>>                 --                 *Adam Hassick* >>>>>>                 Senior Developer >>>>>>                 UNH InterOperability Lab >>>>>> ahassick@iol.unh.edu >>>>>>                 iol.unh.edu >>>>>>                 +1 (603) 475-8248 >>>>> >>>> >>>> >>>> >>>>             --             *Adam Hassick* >>>>             Senior Developer >>>>             UNH InterOperability Lab >>>> ahassick@iol.unh.edu >>>>             iol.unh.edu >>>>             +1 (603) 475-8248 >>>> >>>> >>>> >>>>         --         *Adam Hassick* >>>>         Senior Developer >>>>         UNH InterOperability Lab >>>> ahassick@iol.unh.edu >>>>         iol.unh.edu >>>>         +1 (603) 475-8248 >>> >>> >>> >>>     --     *Adam Hassick* >>>     Senior Developer >>>     UNH InterOperability Lab >>> ahassick@iol.unh.edu >>>     iol.unh.edu >>>     +1 (603) 475-8248 >> >> >> >> -- >> *Adam Hassick* >> Senior Developer >> UNH InterOperability Lab >> ahassick@iol.unh.edu >> iol.unh.edu >> +1 (603) 475-8248 >