On 8/25/23 17:06, Adam Hassick wrote: > Hi Andrew, > > Two of our systems (the Test Engine runner and the DUT host) are > running Ubuntu 20.04 LTS, however this morning I noticed that the > tester system (the one having issues) is running Ubuntu 22.04 LTS. > This could be the source of the problem. I encountered a dependency > issue trying to run the Test Engine on 22.04 LTS, so I downgraded the > system. Since the tester is also the host having connection issues, I > will try downgrading that system to 20.04, and see if that changes > anything. Unlikely, but who knows. We run tests (DUTs) on Ubuntu 20.04, Ubuntu 22.04, Ubuntu 22.10, Ubuntu 23.04, Debian 11 and Fedora 38 every night. Right now Debian 11 is used for test engine in nightly regressions. I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04. > I did try passing in the "--vg-rcf" argument to the run.sh script of > the test suite after installing valgrind, but there was no additional > output that I saw. Sorry, I should valgrind output should be in valgrind.te_rcf (direction where you run test engine). > > I will try pulling in the changes you've pushed up, and will see if > that fixes anything. > > Thanks, > Adam > > On Fri, Aug 25, 2023 at 9:57 AM Andrew Rybchenko > wrote: > > Hello Adam, > > On 8/24/23 23:54, Andrew Rybchenko wrote: >> I'd like to try to repeat the problem locally. Which Linux distro >> is running on test engine and agents? >> >> In fact I know one problem with Debian 12 and Fedora 38 and we have >> patch in review to fix it, however, the behaviour is different in >> this case, so it is unlike the same problem. > > I've just published a new tag which fixes known test engine side > problems on Debian 12 and Fedora 38. > >> >> One more idea is to install valgrind on the test engine host and >> run with option --vg-rcf to check if something weird is happening. >> >> What I don't understand right now is why I see just one failed >> attempt >> to connect in your log.txt and then Logger shutdown after 9 minutes. >> >> Andrew. >> >> On 8/24/23 23:29, Adam Hassick wrote: >>>  > Is there any firewall in the network or on test hosts which >>> could block incoming TCP connection to the port 23571 >>> >>> from the host >>> where you run test engine? >>> >>> Our test engine host and the testbed are on the same subnet. The >>> connection does work sometimes. >>> >>>  > If behaviour the same on the next try and you see that test >>> agent is kept running, could you check using >>>  > >>>  > # netstat -tnlp >>>  > >>>  > that Test Agent is listening on the port and try to establish >>> TCP connection from test agent using >>>  > >>>  > $ telnet iol-dts-tester.dpdklab.iol.unh.edu >>> >>> >>> 23571 >>> >>> >>>  > >>>  > and check if TCP connection could be established. >>> >>> I was able to replicate the same behavior again, where it hangs >>> while RCF is trying to start. >>> Running this command, I see this in the output: >>> >>> tcp        0      0 0.0.0.0:23571 >>> 0.0.0.0:*         >>>       LISTEN      18599/ta >>> >>> So it seems like it is listening on the correct port. >>> Additionally, I was able to connect to the Tester machine from >>> our Test Engine host using telnet. It printed the PID of the >>> process once the connection was opened. >>> >>> I tried running the "ta" application manually on the command >>> line, and it didn't print anything at all. >>> Maybe the issue is something on the Test Engine side. >>> >>> On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko >>> >> >>> > wrote: >>> >>>     Hi Adam, >>> >>>      > On the tester host (which appears to be the Peer agent), >>> there >>>     are four processes that I see running, which look like the test >>>     agent processes. >>> >>>     Before the next try I'd recommend to kill these processes. >>> >>>     Is there any firewall in the network or on test hosts which >>> could >>>     block incoming TCP connection to the port 23571 >>> >>> from the host >>>     where you run test engine? >>> >>>     If behaviour the same on the next try and you see that test >>> agent is >>>     kept running, could you check using >>> >>>     # netstat -tnlp >>> >>>     that Test Agent is listening on the port and try to >>> establish TCP >>>     connection from test agent using >>> >>>     $ telnet iol-dts-tester.dpdklab.iol.unh.edu >>> >>> >>> 23571 >>> >>> >>> >>>     and check if TCP connection could be established. >>> >>>     Another idea is to login Tester under root as testing does, get >>>     start TA command from the log and try it by hands without -n >>> and >>>     remove extra escaping. >>> >>>     # sudo PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1 >>> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1 >>> /tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571 >>> host=iol-dts-tester.dpdklab.iol.unh.edu:port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell= >>> >>>     Hopefully in this case test agent directory remains in the >>> /tmp and >>>     you don't need to copy it as testing does. >>>     May be output could shed some light on what's going on. >>> >>>     Andrew. >>> >>>     On 8/24/23 17:30, Adam Hassick wrote: >>>>     Hi Andrew, >>>> >>>>     This is the output that I see in the terminal when this >>>> failure >>>>     occurs, after the test agent binaries build and the test >>>> engine >>>>     starts: >>>> >>>>     Platform default build - pass >>>>     Simple RCF consistency check succeeded >>>>     --->>> Starting Logger...done >>>>     --->>> Starting RCF...rcf_net_engine_connect(): Connection >>>> timed >>>>     out iol-dts-tester.dpdklab.iol.unh.edu:23571 >>>> >>>> >>>> >>>> >>>>     Then, it hangs here until I kill the "te_rcf" and "te_tee" >>>>     processes. I let it hang for around 9 minutes. >>>> >>>>     On the tester host (which appears to be the Peer agent), >>>> there are >>>>     four processes that I see running, which look like the test >>>> agent >>>>     processes. >>>> >>>>     ta.Peer is an empty file. I've attached the log.txt from >>>> this run. >>>> >>>>      - Adam >>>> >>>>     On Thu, Aug 24, 2023 at 4:22 AM Andrew Rybchenko >>>>     >>> >>>> > wrote: >>>> >>>>         Hi Adam, >>>> >>>>         Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've double-checked >>>>         that it goes to 'copy_timeout' in ts-conf/rcf.conf. >>>>         Description in in doc/sphinx/pages/group_te_engine_rcf.rst >>>>         says that copy_timeout is in seconds and implementation in >>>>         lib/rcfunix/rcfunix.c passes the value to select() tv_sec. >>>>         Theoretically select() could be interrupted by signal, >>>> but I >>>>         think it is unlikely here. >>>> >>>>         I'm not sure that I understand what do you mean by RCF >>>>         connection timeout. Does it happen on TE startup when RCF >>>>         starts test agents. If so, TE_RCFUNIX_TIMEOUT could >>>> help. Or >>>>         does it happen when tests are in progress, e.g. in the >>>> middle >>>>         of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated and most >>>>         likely either host with test agent dies or test agent >>>> itself >>>>         crashes. It would be easier for me if classify it if >>>> you share >>>>         text log (log.txt, full or just corresponding fragment >>>> with >>>>         some context). Also content of ta.DPDK or ta.Peer file >>>>         depending on which agent has problems could shed some >>>> light. >>>>         Corresponding files contain stdout/stderr of test agents. >>>> >>>>         Andrew. >>>> >>>>         On 8/23/23 17:45, Adam Hassick wrote: >>>>>         Hi Andrew, >>>>> >>>>>         I've set up a test rig repository here, and have created >>>>>         configurations for our development testbed based off >>>>> of the >>>>>         examples. >>>>>         We've been able to get the test suite to run manually on >>>>>         Mellanox CX5 devices once. >>>>>         However, we are running into an issue where, when RCF >>>>> starts, >>>>>         the RCF connection times out very frequently. We >>>>> aren't sure >>>>>         why this is the case. >>>>>         It works sometimes, but most of the time when we try >>>>> to run >>>>>         the test engine, it encounters this issue. >>>>>         I've tried changing the RCF port by setting >>>>>         "TE_RCF_PORT=" and rebooting the >>>>> testbed >>>>>         machines. Neither seems to fix the issue. >>>>> >>>>>         It also seems like the timeout takes far longer than 60 >>>>>         seconds, even when running "export TE_RCFUNIX_TIMEOUT=60" >>>>>         before I try to run the test suite. >>>>>         I assume the unit for this variable is seconds? >>>>> >>>>>         Thanks, >>>>>         Adam >>>>> >>>>>         On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick >>>>>         >>>>> > wrote: >>>>> >>>>>             Hi Andrew, >>>>> >>>>>             Thanks, I've cloned the example repository and >>>>> will start >>>>>             setting up a configuration for our development >>>>> testbed >>>>>             today. I'll let you know if I run into any >>>>> difficulties >>>>>             or have any questions. >>>>> >>>>>              - Adam >>>>> >>>>>             On Sun, Aug 20, 2023 at 4:40 AM Andrew Rybchenko >>>>>             >>>> >>>>> > wrote: >>>>> >>>>>                 Hi Adam, >>>>> >>>>>                 I've published >>>>> https://github.com/ts-factory/ts-rigs-sample >>>>> >>>>> . >>>>>                 Hopefully it will help to define your test >>>>> rigs and >>>>>                 successfully run some tests manually. Feel >>>>> free to >>>>>                 ask any questions and I'll answer here and try to >>>>>                 update documentation. >>>>> >>>>>                 Meanwhile I'll prepare missing bits for steps >>>>> (2) and >>>>>                 (3). >>>>>                 Hopefully everything is in place for step (4), >>>>> but we >>>>>                 need to make steps (2) and (3) first. >>>>> >>>>>                 Andrew. >>>>> >>>>>                 On 8/18/23 21:40, Andrew Rybchenko wrote: >>>>>>                 Hi Adam, >>>>>> >>>>>>                 > I've conferred with the rest of the team, >>>>>> and we >>>>>>                 think it would be best to move forward with >>>>>> mainly >>>>>>                 option B. >>>>>> >>>>>>                 OK, I'll provide the sample on Monday for >>>>>> you. It is >>>>>>                 almost ready right now, but I need to >>>>>> double-check >>>>>>                 it before publishing. >>>>>> >>>>>>                 Regards, >>>>>>                 Andrew. >>>>>> >>>>>>                 On 8/17/23 20:03, Adam Hassick wrote: >>>>>>>                 Hi Andrew, >>>>>>> >>>>>>>                 I'm adding the CI mailing list to this >>>>>>>                 conversation. Others in the community might >>>>>>> find >>>>>>>                 this conversation valuable. >>>>>>> >>>>>>>                 We do want to run testing on a regular >>>>>>> basis. The >>>>>>>                 Jenkins integration will be very useful for >>>>>>> us, as >>>>>>>                 most of our CI is orchestrated by Jenkins. >>>>>>>                 I've conferred with the rest of the team, >>>>>>> and we >>>>>>>                 think it would be best to move forward with >>>>>>> mainly >>>>>>>                 option B. >>>>>>>                 If you would like to know anything about our >>>>>>>                 testbeds that would help you with creating an >>>>>>>                 example ts-rigs repo, I'd be happy to answer >>>>>>> any >>>>>>>                 questions you have. >>>>>>> >>>>>>>                 We have multiple test rigs (we call these >>>>>>>                 "DUT-tester pairs") that we run our existing >>>>>>>                 hardware testing on, with differing network >>>>>>>                 hardware and CPU architecture. I figured >>>>>>> this might >>>>>>>                 be an important detail. >>>>>>> >>>>>>>                 Thanks, >>>>>>>                 Adam >>>>>>> >>>>>>>                 On Thu, Aug 17, 2023 at 11:44 AM Andrew >>>>>>> Rybchenko >>>>>>>                 >>>>>> >>>>>>> > wrote: >>>>>>> >>>>>>>                     Greatings Adam, >>>>>>> >>>>>>>                     I'm happy to hear that you're trying to >>>>>>> bring >>>>>>>                     it up. >>>>>>> >>>>>>>                     As I understand the final goal is to run >>>>>>> it on >>>>>>>                     regular basis. So, we need to make it >>>>>>> properly >>>>>>>                     from the very beginning. >>>>>>>                     Bring up of all features consists of 4 >>>>>>> steps: >>>>>>> >>>>>>>                     1. Create site-specific repository (we >>>>>>> call it >>>>>>>                     ts-rigs) which contains information >>>>>>> about test >>>>>>>                     rigs and other site-specific information >>>>>>> like >>>>>>>                     where to send mails, where to store logs >>>>>>> etc. >>>>>>>                     It is required for manual execution as >>>>>>> well, >>>>>>>                     since test rigs description is >>>>>>> essential. I'll >>>>>>>                     return to the topic below. >>>>>>> >>>>>>>                     2. Setup logs storage for automated runs. >>>>>>>                     Basically it is a disk space plus >>>>>>> apache2 web >>>>>>>                     server with few CGI scripts which help a >>>>>>> lot to >>>>>>>                     save disk space. >>>>>>> >>>>>>>                     3. Setup Bublik web application which >>>>>>> provides >>>>>>>                     web interface to view testing results. >>>>>>> Same as >>>>>>> https://ts-factory.io/bublik >>>>>>> >>>>>>> >>>>>>>                     4. Setup Jenkins to run tests on regularly, >>>>>>>                     save logs in log storage (2) and import >>>>>>> it to >>>>>>>                     bublik (3). >>>>>>> >>>>>>>                     Last few month we spent on our homework >>>>>>> to make >>>>>>>                     it simpler to bring up automated execution >>>>>>>                     using Jenkins - >>>>>>> https://github.com/ts-factory/te-jenkins >>>>>>> >>>>>>> >>>>>>>                     Corresponding bits in dpdk-ethdev-ts >>>>>>> will be >>>>>>>                     available tomorrow. >>>>>>> >>>>>>>                     Let's return to the step (1). >>>>>>> >>>>>>>                     Unfortunately there is no publicly >>>>>>> available >>>>>>>                     example of the ts-rigs repository since >>>>>>>                     sensitive site-specific information is >>>>>>> located >>>>>>>                     there. But I'm ready to help you to >>>>>>> create it >>>>>>>                     for UNH. I see two options here: >>>>>>> >>>>>>>                     (A) I'll ask questions and based on your >>>>>>>                     answers will create the first draft with my >>>>>>>                     comments. >>>>>>> >>>>>>>                     (B) I'll make a template/example ts-rigs >>>>>>> repo, >>>>>>>                     publish it and you'll create UNH ts-rigs >>>>>>> based >>>>>>>                     on it. >>>>>>> >>>>>>>                     Of course, I'll help to debug and >>>>>>> finally bring >>>>>>>                     it up in any case. >>>>>>> >>>>>>>                     (A) is a bit simpler for me and you, but >>>>>>> (B) is >>>>>>>                     a bit more generic and will help other >>>>>>>                     potential users to bring it up. >>>>>>>                     We can combine (A)+(B). I.e. start from >>>>>>> (A). >>>>>>>                     What do you think? >>>>>>> >>>>>>>                     Thanks, >>>>>>>                     Andrew. >>>>>>> >>>>>>>                     On 8/17/23 15:18, Konstantin Ushakov wrote: >>>>>>>> Greetings Adam, >>>>>>>> >>>>>>>> >>>>>>>>                     Thanks for contacting us. I copy Andrew >>>>>>>> who >>>>>>>>                     would be happy to help >>>>>>>> >>>>>>>>                     Thanks, >>>>>>>>                     Konstantin >>>>>>>> >>>>>>>>>                     On 16 Aug 2023, at 21:50, Adam Hassick >>>>>>>>> >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>                      >>>>>>>>>                     Greetings Konstantin, >>>>>>>>> >>>>>>>>>                     I am in the process of setting up the >>>>>>>>> DPDK >>>>>>>>>                     Poll Mode Driver test suite as an >>>>>>>>> addition to >>>>>>>>>                     our testing coverage for DPDK at the >>>>>>>>> UNH lab. >>>>>>>>> >>>>>>>>>                     I have some questions about how to set >>>>>>>>> the >>>>>>>>>                     test suite arguments. >>>>>>>>> >>>>>>>>>                     I have been able to configure the Test >>>>>>>>> Engine >>>>>>>>>                     to connect to the hosts in the >>>>>>>>> testbed. The >>>>>>>>>                     RCF, Configurator, and Tester all >>>>>>>>> begin to >>>>>>>>>                     run, however the prelude of the test >>>>>>>>> suite >>>>>>>>>                     fails to run. >>>>>>>>> >>>>>>>>> https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>                     The documentation mentions that there are >>>>>>>>>                     several test parameters for the test >>>>>>>>> suite, >>>>>>>>>                     like for the IUT test link MAC, etc. >>>>>>>>> These >>>>>>>>>                     seem like they would need to be set >>>>>>>>> somewhere >>>>>>>>>                     to run many of the tests. >>>>>>>>> >>>>>>>>>                     I see in the Test Engine >>>>>>>>> documentation, there >>>>>>>>>                     are instructions on how to create new >>>>>>>>>                     parameters for test suites in the Tester >>>>>>>>>                     configuration, but there is nothing in >>>>>>>>> the >>>>>>>>>                     user guide or in the Tester guide for >>>>>>>>> how to >>>>>>>>>                     set the arguments for the parameters when >>>>>>>>>                     running the test suite that I can >>>>>>>>> find. I'm >>>>>>>>>                     not sure if I need to write my own Tester >>>>>>>>>                     config, or if I should be setting >>>>>>>>> these in >>>>>>>>>                     some other way. >>>>>>>>> >>>>>>>>>                     How should these values be set? >>>>>>>>> >>>>>>>>>                     I'm also not sure what environment >>>>>>>>>                     variables/arguments are strictly >>>>>>>>> necessary or >>>>>>>>>                     which are optional. >>>>>>>>> >>>>>>>>>                     Regards, >>>>>>>>>                     Adam >>>>>>>>> >>>>>>>>>                     -- *Adam Hassick* >>>>>>>>>                     Senior Developer >>>>>>>>>                     UNH InterOperability Lab >>>>>>>>> ahassick@iol.unh.edu >>>>>>>>> >>>>>>>>> iol.unh.edu >>>>>>>>> >>>>>>>>>                     +1 (603) 475-8248 >>>>>>> >>>>>>> >>>>>>> >>>>>>>                 --                 *Adam Hassick* >>>>>>>                 Senior Developer >>>>>>>                 UNH InterOperability Lab >>>>>>> ahassick@iol.unh.edu >>>>>>> >>>>>>> iol.unh.edu >>>>>>> >>>>>>>                 +1 (603) 475-8248 >>>>>> >>>>> >>>>> >>>>> >>>>>             --             *Adam Hassick* >>>>>             Senior Developer >>>>>             UNH InterOperability Lab >>>>> ahassick@iol.unh.edu >>>>> >>>>> iol.unh.edu >>>>> >>>>>             +1 (603) 475-8248 >>>>> >>>>> >>>>> >>>>>         --         *Adam Hassick* >>>>>         Senior Developer >>>>>         UNH InterOperability Lab >>>>> ahassick@iol.unh.edu >>>>> >>>>> iol.unh.edu >>>>> >>>>>         +1 (603) 475-8248 >>>> >>>> >>>> >>>>     --     *Adam Hassick* >>>>     Senior Developer >>>>     UNH InterOperability Lab >>>> ahassick@iol.unh.edu >>>> >>>> iol.unh.edu >>>> >>>>     +1 (603) 475-8248 >>> >>> >>> >>> -- >>> *Adam Hassick* >>> Senior Developer >>> UNH InterOperability Lab >>> ahassick@iol.unh.edu >>> >>> iol.unh.edu >>> >>> +1 (603) 475-8248 >> > > > > -- > *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > iol.unh.edu > +1 (603) 475-8248