Hi Andrew, Two of our systems (the Test Engine runner and the DUT host) are running Ubuntu 20.04 LTS, however this morning I noticed that the tester system (the one having issues) is running Ubuntu 22.04 LTS. This could be the source of the problem. I encountered a dependency issue trying to run the Test Engine on 22.04 LTS, so I downgraded the system. Since the tester is also the host having connection issues, I will try downgrading that system to 20.04, and see if that changes anything. I did try passing in the "--vg-rcf" argument to the run.sh script of the test suite after installing valgrind, but there was no additional output that I saw. I will try pulling in the changes you've pushed up, and will see if that fixes anything. Thanks, Adam On Fri, Aug 25, 2023 at 9:57 AM Andrew Rybchenko < andrew.rybchenko@oktetlabs.ru> wrote: > Hello Adam, > > On 8/24/23 23:54, Andrew Rybchenko wrote: > > I'd like to try to repeat the problem locally. Which Linux distro is > running on test engine and agents? > > In fact I know one problem with Debian 12 and Fedora 38 and we have > patch in review to fix it, however, the behaviour is different in > this case, so it is unlike the same problem. > > > I've just published a new tag which fixes known test engine side problems > on Debian 12 and Fedora 38. > > > One more idea is to install valgrind on the test engine host and > run with option --vg-rcf to check if something weird is happening. > > What I don't understand right now is why I see just one failed attempt > to connect in your log.txt and then Logger shutdown after 9 minutes. > > Andrew. > > On 8/24/23 23:29, Adam Hassick wrote: > > > Is there any firewall in the network or on test hosts which could block > incoming TCP connection to the port 23571 > > from the host where you > run test engine? > > Our test engine host and the testbed are on the same subnet. The > connection does work sometimes. > > > If behaviour the same on the next try and you see that test agent is > kept running, could you check using > > > > # netstat -tnlp > > > > that Test Agent is listening on the port and try to establish TCP > connection from test agent using > > > > $ telnet iol-dts-tester.dpdklab.iol.unh.edu > > 23571 > > > > > > and check if TCP connection could be established. > > I was able to replicate the same behavior again, where it hangs while RCF > is trying to start. > Running this command, I see this in the output: > > tcp 0 0 0.0.0.0:23571 > 0.0.0.0:* LISTEN > 18599/ta > > So it seems like it is listening on the correct port. > Additionally, I was able to connect to the Tester machine from our Test > Engine host using telnet. It printed the PID of the process once the > connection was opened. > > I tried running the "ta" application manually on the command line, and it > didn't print anything at all. > Maybe the issue is something on the Test Engine side. > > On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko < > andrew.rybchenko@oktetlabs.ru > > wrote: > > Hi Adam, > > > On the tester host (which appears to be the Peer agent), there > are four processes that I see running, which look like the test > agent processes. > > Before the next try I'd recommend to kill these processes. > > Is there any firewall in the network or on test hosts which could > block incoming TCP connection to the port 23571 > > from the host > where you run test engine? > > If behaviour the same on the next try and you see that test agent is > kept running, could you check using > > # netstat -tnlp > > that Test Agent is listening on the port and try to establish TCP > connection from test agent using > > $ telnet iol-dts-tester.dpdklab.iol.unh.edu > > 23571 > > > > and check if TCP connection could be established. > > Another idea is to login Tester under root as testing does, get > start TA command from the log and try it by hands without -n and > remove extra escaping. > > # sudo PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1 > > LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1 > /tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571 > host=iol-dts-tester.dpdklab.iol.unh.edu: > port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell= > > Hopefully in this case test agent directory remains in the /tmp and > you don't need to copy it as testing does. > May be output could shed some light on what's going on. > > Andrew. > > On 8/24/23 17:30, Adam Hassick wrote: > > Hi Andrew, > > This is the output that I see in the terminal when this failure > occurs, after the test agent binaries build and the test engine > starts: > > Platform default build - pass > Simple RCF consistency check succeeded > --->>> Starting Logger...done > --->>> Starting RCF...rcf_net_engine_connect(): Connection timed > out iol-dts-tester.dpdklab.iol.unh.edu:23571 > > > > Then, it hangs here until I kill the "te_rcf" and "te_tee" > processes. I let it hang for around 9 minutes. > > On the tester host (which appears to be the Peer agent), there are > four processes that I see running, which look like the test agent > processes. > > ta.Peer is an empty file. I've attached the log.txt from this run. > > - Adam > > On Thu, Aug 24, 2023 at 4:22 AM Andrew Rybchenko > > > wrote: > > Hi Adam, > > Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've double-checked > that it goes to 'copy_timeout' in ts-conf/rcf.conf. > Description in in doc/sphinx/pages/group_te_engine_rcf.rst > says that copy_timeout is in seconds and implementation in > lib/rcfunix/rcfunix.c passes the value to select() tv_sec. > Theoretically select() could be interrupted by signal, but I > think it is unlikely here. > > I'm not sure that I understand what do you mean by RCF > connection timeout. Does it happen on TE startup when RCF > starts test agents. If so, TE_RCFUNIX_TIMEOUT could help. Or > does it happen when tests are in progress, e.g. in the middle > of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated and most > likely either host with test agent dies or test agent itself > crashes. It would be easier for me if classify it if you share > text log (log.txt, full or just corresponding fragment with > some context). Also content of ta.DPDK or ta.Peer file > depending on which agent has problems could shed some light. > Corresponding files contain stdout/stderr of test agents. > > Andrew. > > On 8/23/23 17:45, Adam Hassick wrote: > > Hi Andrew, > > I've set up a test rig repository here, and have created > configurations for our development testbed based off of the > examples. > We've been able to get the test suite to run manually on > Mellanox CX5 devices once. > However, we are running into an issue where, when RCF starts, > the RCF connection times out very frequently. We aren't sure > why this is the case. > It works sometimes, but most of the time when we try to run > the test engine, it encounters this issue. > I've tried changing the RCF port by setting > "TE_RCF_PORT=" and rebooting the testbed > machines. Neither seems to fix the issue. > > It also seems like the timeout takes far longer than 60 > seconds, even when running "export TE_RCFUNIX_TIMEOUT=60" > before I try to run the test suite. > I assume the unit for this variable is seconds? > > Thanks, > Adam > > On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick > > > wrote: > > Hi Andrew, > > Thanks, I've cloned the example repository and will start > setting up a configuration for our development testbed > today. I'll let you know if I run into any difficulties > or have any questions. > > - Adam > > On Sun, Aug 20, 2023 at 4:40 AM Andrew Rybchenko > > > wrote: > > Hi Adam, > > I've published > https://github.com/ts-factory/ts-rigs-sample > > . > Hopefully it will help to define your test rigs and > successfully run some tests manually. Feel free to > ask any questions and I'll answer here and try to > update documentation. > > Meanwhile I'll prepare missing bits for steps (2) and > (3). > Hopefully everything is in place for step (4), but we > need to make steps (2) and (3) first. > > Andrew. > > On 8/18/23 21:40, Andrew Rybchenko wrote: > > Hi Adam, > > > I've conferred with the rest of the team, and we > think it would be best to move forward with mainly > option B. > > OK, I'll provide the sample on Monday for you. It is > almost ready right now, but I need to double-check > it before publishing. > > Regards, > Andrew. > > On 8/17/23 20:03, Adam Hassick wrote: > > Hi Andrew, > > I'm adding the CI mailing list to this > conversation. Others in the community might find > this conversation valuable. > > We do want to run testing on a regular basis. The > Jenkins integration will be very useful for us, as > most of our CI is orchestrated by Jenkins. > I've conferred with the rest of the team, and we > think it would be best to move forward with mainly > option B. > If you would like to know anything about our > testbeds that would help you with creating an > example ts-rigs repo, I'd be happy to answer any > questions you have. > > We have multiple test rigs (we call these > "DUT-tester pairs") that we run our existing > hardware testing on, with differing network > hardware and CPU architecture. I figured this might > be an important detail. > > Thanks, > Adam > > On Thu, Aug 17, 2023 at 11:44 AM Andrew Rybchenko > > > wrote: > > Greatings Adam, > > I'm happy to hear that you're trying to bring > it up. > > As I understand the final goal is to run it on > regular basis. So, we need to make it properly > from the very beginning. > Bring up of all features consists of 4 steps: > > 1. Create site-specific repository (we call it > ts-rigs) which contains information about test > rigs and other site-specific information like > where to send mails, where to store logs etc. > It is required for manual execution as well, > since test rigs description is essential. I'll > return to the topic below. > > 2. Setup logs storage for automated runs. > Basically it is a disk space plus apache2 web > server with few CGI scripts which help a lot to > save disk space. > > 3. Setup Bublik web application which provides > web interface to view testing results. Same as > https://ts-factory.io/bublik > > > > 4. Setup Jenkins to run tests on regularly, > save logs in log storage (2) and import it to > bublik (3). > > Last few month we spent on our homework to make > it simpler to bring up automated execution > using Jenkins - > https://github.com/ts-factory/te-jenkins > > > Corresponding bits in dpdk-ethdev-ts will be > available tomorrow. > > Let's return to the step (1). > > Unfortunately there is no publicly available > example of the ts-rigs repository since > sensitive site-specific information is located > there. But I'm ready to help you to create it > for UNH. I see two options here: > > (A) I'll ask questions and based on your > answers will create the first draft with my > comments. > > (B) I'll make a template/example ts-rigs repo, > publish it and you'll create UNH ts-rigs based > on it. > > Of course, I'll help to debug and finally bring > it up in any case. > > (A) is a bit simpler for me and you, but (B) is > a bit more generic and will help other > potential users to bring it up. > We can combine (A)+(B). I.e. start from (A). > What do you think? > > Thanks, > Andrew. > > On 8/17/23 15:18, Konstantin Ushakov wrote: > > Greetings Adam, > > > Thanks for contacting us. I copy Andrew who > would be happy to help > > Thanks, > Konstantin > > On 16 Aug 2023, at 21:50, Adam Hassick > > > wrote: > >  > Greetings Konstantin, > > I am in the process of setting up the DPDK > Poll Mode Driver test suite as an addition to > our testing coverage for DPDK at the UNH lab. > > I have some questions about how to set the > test suite arguments. > > I have been able to configure the Test Engine > to connect to the hosts in the testbed. The > RCF, Configurator, and Tester all begin to > run, however the prelude of the test suite > fails to run. > > > https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters > > > > The documentation mentions that there are > several test parameters for the test suite, > like for the IUT test link MAC, etc. These > seem like they would need to be set somewhere > to run many of the tests. > > I see in the Test Engine documentation, there > are instructions on how to create new > parameters for test suites in the Tester > configuration, but there is nothing in the > user guide or in the Tester guide for how to > set the arguments for the parameters when > running the test suite that I can find. I'm > not sure if I need to write my own Tester > config, or if I should be setting these in > some other way. > > How should these values be set? > > I'm also not sure what environment > variables/arguments are strictly necessary or > which are optional. > > Regards, > Adam > > -- *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > > iol.unh.edu > > +1 (603) 475-8248 > > > > > -- *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > > iol.unh.edu > > +1 (603) 475-8248 > > > > > > -- *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > > iol.unh.edu > > +1 (603) 475-8248 > > > > -- *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > > iol.unh.edu > +1 (603) 475-8248 > > > > > -- *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > > iol.unh.edu > +1 (603) 475-8248 > > > > > -- > *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > iol.unh.edu > +1 (603) 475-8248 > > > > -- *Adam Hassick* Senior Developer UNH InterOperability Lab ahassick@iol.unh.edu iol.unh.edu +1 (603) 475-8248