From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 643D341B43; Mon, 28 Aug 2023 23:05:49 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 497A84025E; Mon, 28 Aug 2023 23:05:49 +0200 (CEST) Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113]) by mails.dpdk.org (Postfix) with ESMTP id D3FC04021E for ; Mon, 28 Aug 2023 23:05:47 +0200 (CEST) Received: from [192.168.1.126] (unknown [188.242.176.176]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 68E5966; Tue, 29 Aug 2023 00:05:46 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 68E5966 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru; s=default; t=1693256746; bh=vQHy49gmEjKLvT+wvgc7UCqjCyxn7/oNBEPbW1fNzwo=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=HcNjOsYxkIJAyUYhTiSpkfIJCplTCNhN6TZLiUuYVX4NHtq0apndjc+3u4ZDfWiaG x0za9osz1pYlLUD0L47Zmpk1JeUgGAl0eovke9C7CI8AIoJBnEnB+J3S9MFreIb8W8 aR8y4qdG01isu1IY2d1rfe1jc8bGXsmB/X7khTTs= Content-Type: multipart/alternative; boundary="------------zJ3OEYrcU8dBLinkGB8bw4cJ" Message-ID: <74cec43e-43d5-eb4c-caa2-8ebada2680c1@oktetlabs.ru> Date: Tue, 29 Aug 2023 00:05:38 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Subject: Re: Setting up DPDK PMD Test Suite Content-Language: en-US To: Adam Hassick Cc: Patrick Robb , Konstantin Ushakov , ci@dpdk.org References: <9ce9d7fd-4051-6d51-26bb-7e96e98c677e@oktetlabs.ru> <781ca146-955f-85af-5727-66015ae1d326@oktetlabs.ru> <7734826a-840d-d0d9-e7a5-91951223398c@oktetlabs.ru> <9d920676-485d-3b4d-ca20-2b5ea3a5b606@oktetlabs.ru> <873c7972-3e5a-9e82-9449-4d12b2c96032@oktetlabs.ru> <6c9eea95-7d53-c82c-bced-823e4e9db62a@oktetlabs.ru> From: Andrew Rybchenko In-Reply-To: X-BeenThere: ci@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK CI discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ci-bounces@dpdk.org This is a multi-part message in MIME format. --------------zJ3OEYrcU8dBLinkGB8bw4cJ Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hi Adam, > Does the test engine prefer to use IPv6 over IPv4 for initiating the RCF connection to the test bed hosts? And if so, is there a way to force it to use IPv4? Brilliant idea. If DNS returns both IPv4 and IPv6 addresses in your case, I guess it is the root cause of the problem. Of course, it is TE problem since I see really weird code in lib/comm_net_engine/comm_net_engine.c line 135. I've pushed fix to the branch user/arybchik/fix_ipv4_only in ts-factory/test-environment repository. Please, try. It is late night fix with minimal testing and no review. I'll pass it through review process tomorrow and hopefully it will be released in one-two days. Andrew. On 8/28/23 18:02, Adam Hassick wrote: > Hi Andrew, > > We have yet to notice a distinct pattern with the failures. Sometimes, > the RCF will start and connect without issue a few times in a row > before failing to connect again. Once the issue begins to occur, > neither rebooting all of the hosts (test engine VM, tester, IUT) or > deleting all of the build directories (suites, agents, inst) and > rebooting the hosts afterward resolves the issue. When it begins > working again seems very arbitrary to us. > > I do usually try to terminate the test engine with Ctrl+C, but when it > hangs while trying to start RCF, that does not work. > > Does the test engine prefer to use IPv6 over IPv4 for initiating the > RCF connection to the test bed hosts? And if so, is there a way to > force it to use IPv4? > >  - Adam > > On Fri, Aug 25, 2023 at 1:35 PM Andrew Rybchenko > wrote: > > > I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04. > > Done. It works fine for me without any issues. > > Have you noticed any pattern when it works or does not work? > May be it is a problem of not clean state after termination? > Does it work fine the first time after DUTs reboot? > How do you terminate testing? It should be done using Ctrl+C in > terminal where you execute run.sh command. >  In this case it should shutdown gracefully and close all test > agents and engine applications. > > (I'm trying to understand why you've seen many test agent > processes. It should not happen.) > > Andrew. > > On 8/25/23 17:41, Andrew Rybchenko wrote: >> On 8/25/23 17:06, Adam Hassick wrote: >>> Hi Andrew, >>> >>> Two of our systems (the Test Engine runner and the DUT host) are >>> running Ubuntu 20.04 LTS, however this morning I noticed that >>> the tester system (the one having issues) is running Ubuntu >>> 22.04 LTS. >>> This could be the source of the problem. I encountered a >>> dependency issue trying to run the Test Engine on 22.04 LTS, so >>> I downgraded the system. Since the tester is also the host >>> having connection issues, I will try downgrading that system to >>> 20.04, and see if that changes anything. >> >> Unlikely, but who knows. We run tests (DUTs) on Ubuntu 20.04, >> Ubuntu 22.04, Ubuntu 22.10, Ubuntu 23.04, Debian 11 and Fedora 38 >> every night. >> Right now Debian 11 is used for test engine in nightly regressions. >> >> I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04. >> >>> I did try passing in the "--vg-rcf" argument to the run.sh >>> script of the test suite after installing valgrind, but there >>> was no additional output that I saw. >> >> Sorry, I should valgrind output should be in valgrind.te_rcf >> (direction where you run test engine). >> >>> >>> I will try pulling in the changes you've pushed up, and will see >>> if that fixes anything. >>> >>> Thanks, >>> Adam >>> >>> On Fri, Aug 25, 2023 at 9:57 AM Andrew Rybchenko >>> wrote: >>> >>> Hello Adam, >>> >>> On 8/24/23 23:54, Andrew Rybchenko wrote: >>>> I'd like to try to repeat the problem locally. Which Linux >>>> distro is running on test engine and agents? >>>> >>>> In fact I know one problem with Debian 12 and Fedora 38 and >>>> we have >>>> patch in review to fix it, however, the behaviour is >>>> different in >>>> this case, so it is unlike the same problem. >>> >>> I've just published a new tag which fixes known test engine >>> side problems on Debian 12 and Fedora 38. >>> >>>> >>>> One more idea is to install valgrind on the test engine >>>> host and >>>> run with option --vg-rcf to check if something weird is >>>> happening. >>>> >>>> What I don't understand right now is why I see just one >>>> failed attempt >>>> to connect in your log.txt and then Logger shutdown after 9 >>>> minutes. >>>> >>>> Andrew. >>>> >>>> On 8/24/23 23:29, Adam Hassick wrote: >>>>>  > Is there any firewall in the network or on test hosts >>>>> which could block incoming TCP connection to the port >>>>> 23571 >>>>> from the >>>>> host where you run test engine? >>>>> >>>>> Our test engine host and the testbed are on the same >>>>> subnet. The connection does work sometimes. >>>>> >>>>>  > If behaviour the same on the next try and you see that >>>>> test agent is kept running, could you check using >>>>>  > >>>>>  > # netstat -tnlp >>>>>  > >>>>>  > that Test Agent is listening on the port and try to >>>>> establish TCP connection from test agent using >>>>>  > >>>>>  > $ telnet iol-dts-tester.dpdklab.iol.unh.edu >>>>> >>>>> >>>>> 23571 >>>>> >>>>> >>>>>  > >>>>>  > and check if TCP connection could be established. >>>>> >>>>> I was able to replicate the same behavior again, where it >>>>> hangs while RCF is trying to start. >>>>> Running this command, I see this in the output: >>>>> >>>>> tcp        0      0 0.0.0.0:23571 >>>>> 0.0.0.0:*   >>>>>             LISTEN      18599/ta >>>>> >>>>> So it seems like it is listening on the correct port. >>>>> Additionally, I was able to connect to the Tester machine >>>>> from our Test Engine host using telnet. It printed the PID >>>>> of the process once the connection was opened. >>>>> >>>>> I tried running the "ta" application manually on the >>>>> command line, and it didn't print anything at all. >>>>> Maybe the issue is something on the Test Engine side. >>>>> >>>>> On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko >>>>> >>>> >>>>> > wrote: >>>>> >>>>>     Hi Adam, >>>>> >>>>>      > On the tester host (which appears to be the Peer >>>>> agent), there >>>>>     are four processes that I see running, which look like >>>>> the test >>>>>     agent processes. >>>>> >>>>>     Before the next try I'd recommend to kill these >>>>> processes. >>>>> >>>>>     Is there any firewall in the network or on test hosts >>>>> which could >>>>>     block incoming TCP connection to the port 23571 >>>>> >>>>> from the >>>>> host >>>>>     where you run test engine? >>>>> >>>>>     If behaviour the same on the next try and you see that >>>>> test agent is >>>>>     kept running, could you check using >>>>> >>>>>     # netstat -tnlp >>>>> >>>>>     that Test Agent is listening on the port and try to >>>>> establish TCP >>>>>     connection from test agent using >>>>> >>>>>     $ telnet iol-dts-tester.dpdklab.iol.unh.edu >>>>> >>>>> >>>>> 23571 >>>>> >>>>> >>>>> >>>>>     and check if TCP connection could be established. >>>>> >>>>>     Another idea is to login Tester under root as testing >>>>> does, get >>>>>     start TA command from the log and try it by hands >>>>> without -n and >>>>>     remove extra escaping. >>>>> >>>>>     # sudo >>>>> PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1 >>>>> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1 >>>>> /tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571 >>>>> host=iol-dts-tester.dpdklab.iol.unh.edu:port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell= >>>>> >>>>>     Hopefully in this case test agent directory remains in >>>>> the /tmp and >>>>>     you don't need to copy it as testing does. >>>>>     May be output could shed some light on what's going on. >>>>> >>>>>     Andrew. >>>>> >>>>>     On 8/24/23 17:30, Adam Hassick wrote: >>>>>>     Hi Andrew, >>>>>> >>>>>>     This is the output that I see in the terminal when >>>>>> this failure >>>>>>     occurs, after the test agent binaries build and the >>>>>> test engine >>>>>>     starts: >>>>>> >>>>>>     Platform default build - pass >>>>>>     Simple RCF consistency check succeeded >>>>>>     --->>> Starting Logger...done >>>>>>     --->>> Starting RCF...rcf_net_engine_connect(): >>>>>> Connection timed >>>>>>     out iol-dts-tester.dpdklab.iol.unh.edu:23571 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>     Then, it hangs here until I kill the "te_rcf" and >>>>>> "te_tee" >>>>>>     processes. I let it hang for around 9 minutes. >>>>>> >>>>>>     On the tester host (which appears to be the Peer >>>>>> agent), there are >>>>>>     four processes that I see running, which look like >>>>>> the test agent >>>>>>     processes. >>>>>> >>>>>>     ta.Peer is an empty file. I've attached the log.txt >>>>>> from this run. >>>>>> >>>>>>      - Adam >>>>>> >>>>>>     On Thu, Aug 24, 2023 at 4:22 AM Andrew Rybchenko >>>>>>     >>>>> >>>>>> > wrote: >>>>>> >>>>>>         Hi Adam, >>>>>> >>>>>>         Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've >>>>>> double-checked >>>>>>         that it goes to 'copy_timeout' in ts-conf/rcf.conf. >>>>>>         Description in in >>>>>> doc/sphinx/pages/group_te_engine_rcf.rst >>>>>>         says that copy_timeout is in seconds and >>>>>> implementation in >>>>>>         lib/rcfunix/rcfunix.c passes the value to >>>>>> select() tv_sec. >>>>>>         Theoretically select() could be interrupted by >>>>>> signal, but I >>>>>>         think it is unlikely here. >>>>>> >>>>>>         I'm not sure that I understand what do you mean >>>>>> by RCF >>>>>>         connection timeout. Does it happen on TE startup >>>>>> when RCF >>>>>>         starts test agents. If so, TE_RCFUNIX_TIMEOUT >>>>>> could help. Or >>>>>>         does it happen when tests are in progress, e.g. >>>>>> in the middle >>>>>>         of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated >>>>>> and most >>>>>>         likely either host with test agent dies or test >>>>>> agent itself >>>>>>         crashes. It would be easier for me if classify it >>>>>> if you share >>>>>>         text log (log.txt, full or just corresponding >>>>>> fragment with >>>>>>         some context). Also content of ta.DPDK or ta.Peer >>>>>> file >>>>>>         depending on which agent has problems could shed >>>>>> some light. >>>>>>         Corresponding files contain stdout/stderr of test >>>>>> agents. >>>>>> >>>>>>         Andrew. >>>>>> >>>>>>         On 8/23/23 17:45, Adam Hassick wrote: >>>>>>>         Hi Andrew, >>>>>>> >>>>>>>         I've set up a test rig repository here, and have >>>>>>> created >>>>>>>         configurations for our development testbed based >>>>>>> off of the >>>>>>>         examples. >>>>>>>         We've been able to get the test suite to run >>>>>>> manually on >>>>>>>         Mellanox CX5 devices once. >>>>>>>         However, we are running into an issue where, >>>>>>> when RCF starts, >>>>>>>         the RCF connection times out very frequently. We >>>>>>> aren't sure >>>>>>>         why this is the case. >>>>>>>         It works sometimes, but most of the time when we >>>>>>> try to run >>>>>>>         the test engine, it encounters this issue. >>>>>>>         I've tried changing the RCF port by setting >>>>>>>         "TE_RCF_PORT=" and rebooting >>>>>>> the testbed >>>>>>>         machines. Neither seems to fix the issue. >>>>>>> >>>>>>>         It also seems like the timeout takes far longer >>>>>>> than 60 >>>>>>>         seconds, even when running "export >>>>>>> TE_RCFUNIX_TIMEOUT=60" >>>>>>>         before I try to run the test suite. >>>>>>>         I assume the unit for this variable is seconds? >>>>>>> >>>>>>>         Thanks, >>>>>>>         Adam >>>>>>> >>>>>>>         On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick >>>>>>>         >>>>>> >>>>>>> > wrote: >>>>>>> >>>>>>>             Hi Andrew, >>>>>>> >>>>>>>             Thanks, I've cloned the example repository >>>>>>> and will start >>>>>>>             setting up a configuration for our >>>>>>> development testbed >>>>>>>             today. I'll let you know if I run into any >>>>>>> difficulties >>>>>>>             or have any questions. >>>>>>> >>>>>>>              - Adam >>>>>>> >>>>>>>             On Sun, Aug 20, 2023 at 4:40 AM Andrew >>>>>>> Rybchenko >>>>>>>             >>>>>> >>>>>>> > wrote: >>>>>>> >>>>>>>                 Hi Adam, >>>>>>> >>>>>>>                 I've published >>>>>>> https://github.com/ts-factory/ts-rigs-sample >>>>>>> >>>>>>> . >>>>>>>                 Hopefully it will help to define your >>>>>>> test rigs and >>>>>>>                 successfully run some tests manually. >>>>>>> Feel free to >>>>>>>                 ask any questions and I'll answer here >>>>>>> and try to >>>>>>>                 update documentation. >>>>>>> >>>>>>>                 Meanwhile I'll prepare missing bits for >>>>>>> steps (2) and >>>>>>>                 (3). >>>>>>>                 Hopefully everything is in place for >>>>>>> step (4), but we >>>>>>>                 need to make steps (2) and (3) first. >>>>>>> >>>>>>>                 Andrew. >>>>>>> >>>>>>>                 On 8/18/23 21:40, Andrew Rybchenko wrote: >>>>>>>>                 Hi Adam, >>>>>>>> >>>>>>>>                 > I've conferred with the rest of the >>>>>>>> team, and we >>>>>>>>                 think it would be best to move forward >>>>>>>> with mainly >>>>>>>>                 option B. >>>>>>>> >>>>>>>>                 OK, I'll provide the sample on Monday >>>>>>>> for you. It is >>>>>>>>                 almost ready right now, but I need to >>>>>>>> double-check >>>>>>>>                 it before publishing. >>>>>>>> >>>>>>>>                 Regards, >>>>>>>>                 Andrew. >>>>>>>> >>>>>>>>                 On 8/17/23 20:03, Adam Hassick wrote: >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>>                 I'm adding the CI mailing list to this >>>>>>>>>                 conversation. Others in the community >>>>>>>>> might find >>>>>>>>>                 this conversation valuable. >>>>>>>>> >>>>>>>>>                 We do want to run testing on a regular >>>>>>>>> basis. The >>>>>>>>>                 Jenkins integration will be very >>>>>>>>> useful for us, as >>>>>>>>>                 most of our CI is orchestrated by >>>>>>>>> Jenkins. >>>>>>>>>                 I've conferred with the rest of the >>>>>>>>> team, and we >>>>>>>>>                 think it would be best to move forward >>>>>>>>> with mainly >>>>>>>>>                 option B. >>>>>>>>>                 If you would like to know anything >>>>>>>>> about our >>>>>>>>>                 testbeds that would help you with >>>>>>>>> creating an >>>>>>>>>                 example ts-rigs repo, I'd be happy to >>>>>>>>> answer any >>>>>>>>>                 questions you have. >>>>>>>>> >>>>>>>>>                 We have multiple test rigs (we call these >>>>>>>>>                 "DUT-tester pairs") that we run our >>>>>>>>> existing >>>>>>>>>                 hardware testing on, with differing >>>>>>>>> network >>>>>>>>>                 hardware and CPU architecture. I >>>>>>>>> figured this might >>>>>>>>>                 be an important detail. >>>>>>>>> >>>>>>>>>                 Thanks, >>>>>>>>>                 Adam >>>>>>>>> >>>>>>>>>                 On Thu, Aug 17, 2023 at 11:44 AM >>>>>>>>> Andrew Rybchenko >>>>>>>>>                 >>>>>>>> >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>                     Greatings Adam, >>>>>>>>> >>>>>>>>>                     I'm happy to hear that you're >>>>>>>>> trying to bring >>>>>>>>>                     it up. >>>>>>>>> >>>>>>>>>                     As I understand the final goal is >>>>>>>>> to run it on >>>>>>>>>                     regular basis. So, we need to make >>>>>>>>> it properly >>>>>>>>>                     from the very beginning. >>>>>>>>>                     Bring up of all features consists >>>>>>>>> of 4 steps: >>>>>>>>> >>>>>>>>>                     1. Create site-specific repository >>>>>>>>> (we call it >>>>>>>>>                     ts-rigs) which contains >>>>>>>>> information about test >>>>>>>>>                     rigs and other site-specific >>>>>>>>> information like >>>>>>>>>                     where to send mails, where to >>>>>>>>> store logs etc. >>>>>>>>>                     It is required for manual >>>>>>>>> execution as well, >>>>>>>>>                     since test rigs description is >>>>>>>>> essential. I'll >>>>>>>>>                     return to the topic below. >>>>>>>>> >>>>>>>>>                     2. Setup logs storage for >>>>>>>>> automated runs. >>>>>>>>>                     Basically it is a disk space plus >>>>>>>>> apache2 web >>>>>>>>>                     server with few CGI scripts which >>>>>>>>> help a lot to >>>>>>>>>                     save disk space. >>>>>>>>> >>>>>>>>>                     3. Setup Bublik web application >>>>>>>>> which provides >>>>>>>>>                     web interface to view testing >>>>>>>>> results. Same as >>>>>>>>> https://ts-factory.io/bublik >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>                     4. Setup Jenkins to run tests on >>>>>>>>> regularly, >>>>>>>>>                     save logs in log storage (2) and >>>>>>>>> import it to >>>>>>>>>                     bublik (3). >>>>>>>>> >>>>>>>>>                     Last few month we spent on our >>>>>>>>> homework to make >>>>>>>>>                     it simpler to bring up automated >>>>>>>>> execution >>>>>>>>>                     using Jenkins - >>>>>>>>> https://github.com/ts-factory/te-jenkins >>>>>>>>> >>>>>>>>> >>>>>>>>>                     Corresponding bits in >>>>>>>>> dpdk-ethdev-ts will be >>>>>>>>>                     available tomorrow. >>>>>>>>> >>>>>>>>>                     Let's return to the step (1). >>>>>>>>> >>>>>>>>>                     Unfortunately there is no publicly >>>>>>>>> available >>>>>>>>>                     example of the ts-rigs repository >>>>>>>>> since >>>>>>>>>                     sensitive site-specific >>>>>>>>> information is located >>>>>>>>>                     there. But I'm ready to help you >>>>>>>>> to create it >>>>>>>>>                     for UNH. I see two options here: >>>>>>>>> >>>>>>>>>                     (A) I'll ask questions and based >>>>>>>>> on your >>>>>>>>>                     answers will create the first >>>>>>>>> draft with my >>>>>>>>>                     comments. >>>>>>>>> >>>>>>>>>                     (B) I'll make a template/example >>>>>>>>> ts-rigs repo, >>>>>>>>>                     publish it and you'll create UNH >>>>>>>>> ts-rigs based >>>>>>>>>                     on it. >>>>>>>>> >>>>>>>>>                     Of course, I'll help to debug and >>>>>>>>> finally bring >>>>>>>>>                     it up in any case. >>>>>>>>> >>>>>>>>>                     (A) is a bit simpler for me and >>>>>>>>> you, but (B) is >>>>>>>>>                     a bit more generic and will help >>>>>>>>> other >>>>>>>>>                     potential users to bring it up. >>>>>>>>>                     We can combine (A)+(B). I.e. start >>>>>>>>> from (A). >>>>>>>>>                     What do you think? >>>>>>>>> >>>>>>>>>                     Thanks, >>>>>>>>>                     Andrew. >>>>>>>>> >>>>>>>>>                     On 8/17/23 15:18, Konstantin >>>>>>>>> Ushakov wrote: >>>>>>>>>> Greetings Adam, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>                     Thanks for contacting us. I copy >>>>>>>>>> Andrew who >>>>>>>>>>                     would be happy to help >>>>>>>>>> >>>>>>>>>>                     Thanks, >>>>>>>>>>                     Konstantin >>>>>>>>>> >>>>>>>>>>> On 16 Aug 2023, at 21:50, Adam Hassick >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>                      >>>>>>>>>>>                     Greetings Konstantin, >>>>>>>>>>> >>>>>>>>>>>                     I am in the process of setting >>>>>>>>>>> up the DPDK >>>>>>>>>>>                     Poll Mode Driver test suite as >>>>>>>>>>> an addition to >>>>>>>>>>>                     our testing coverage for DPDK at >>>>>>>>>>> the UNH lab. >>>>>>>>>>> >>>>>>>>>>>                     I have some questions about how >>>>>>>>>>> to set the >>>>>>>>>>>                     test suite arguments. >>>>>>>>>>> >>>>>>>>>>>                     I have been able to configure >>>>>>>>>>> the Test Engine >>>>>>>>>>>                     to connect to the hosts in the >>>>>>>>>>> testbed. The >>>>>>>>>>>                     RCF, Configurator, and Tester >>>>>>>>>>> all begin to >>>>>>>>>>>                     run, however the prelude of the >>>>>>>>>>> test suite >>>>>>>>>>>                     fails to run. >>>>>>>>>>> >>>>>>>>>>> https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>                     The documentation mentions that >>>>>>>>>>> there are >>>>>>>>>>>                     several test parameters for the >>>>>>>>>>> test suite, >>>>>>>>>>>                     like for the IUT test link MAC, >>>>>>>>>>> etc. These >>>>>>>>>>>                     seem like they would need to be >>>>>>>>>>> set somewhere >>>>>>>>>>>                     to run many of the tests. >>>>>>>>>>> >>>>>>>>>>>                     I see in the Test Engine >>>>>>>>>>> documentation, there >>>>>>>>>>>                     are instructions on how to >>>>>>>>>>> create new >>>>>>>>>>>                     parameters for test suites in >>>>>>>>>>> the Tester >>>>>>>>>>>                     configuration, but there is >>>>>>>>>>> nothing in the >>>>>>>>>>>                     user guide or in the Tester >>>>>>>>>>> guide for how to >>>>>>>>>>>                     set the arguments for the >>>>>>>>>>> parameters when >>>>>>>>>>>                     running the test suite that I >>>>>>>>>>> can find. I'm >>>>>>>>>>>                     not sure if I need to write my >>>>>>>>>>> own Tester >>>>>>>>>>>                     config, or if I should be >>>>>>>>>>> setting these in >>>>>>>>>>>                     some other way. >>>>>>>>>>> >>>>>>>>>>>                     How should these values be set? >>>>>>>>>>> >>>>>>>>>>>                     I'm also not sure what environment >>>>>>>>>>> variables/arguments are strictly necessary or >>>>>>>>>>>                     which are optional. >>>>>>>>>>> >>>>>>>>>>>                     Regards, >>>>>>>>>>>                     Adam >>>>>>>>>>> >>>>>>>>>>>                     --                     *Adam >>>>>>>>>>> Hassick* >>>>>>>>>>>                     Senior Developer >>>>>>>>>>>                     UNH InterOperability Lab >>>>>>>>>>> ahassick@iol.unh.edu >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> iol.unh.edu >>>>>>>>>>> >>>>>>>>>>>                     +1 (603) 475-8248 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>                 -- *Adam Hassick* >>>>>>>>>                 Senior Developer >>>>>>>>>                 UNH InterOperability Lab >>>>>>>>> ahassick@iol.unh.edu >>>>>>>>> >>>>>>>>> iol.unh.edu >>>>>>>>> >>>>>>>>>                 +1 (603) 475-8248 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>             --             *Adam Hassick* >>>>>>>             Senior Developer >>>>>>>             UNH InterOperability Lab >>>>>>> ahassick@iol.unh.edu >>>>>>> >>>>>>> iol.unh.edu >>>>>>> >>>>>>>             +1 (603) 475-8248 >>>>>>> >>>>>>> >>>>>>> >>>>>>>         --         *Adam Hassick* >>>>>>>         Senior Developer >>>>>>>         UNH InterOperability Lab >>>>>>> ahassick@iol.unh.edu >>>>>>> >>>>>>> iol.unh.edu >>>>>>> >>>>>>>         +1 (603) 475-8248 >>>>>> >>>>>> >>>>>> >>>>>>     --     *Adam Hassick* >>>>>>     Senior Developer >>>>>>     UNH InterOperability Lab >>>>>> ahassick@iol.unh.edu >>>>>> >>>>>> iol.unh.edu >>>>>> >>>>>>     +1 (603) 475-8248 >>>>> >>>>> >>>>> >>>>> -- >>>>> *Adam Hassick* >>>>> Senior Developer >>>>> UNH InterOperability Lab >>>>> ahassick@iol.unh.edu >>>>> >>>>> iol.unh.edu >>>>> >>>>> +1 (603) 475-8248 >>>> >>> >>> >>> >>> -- >>> *Adam Hassick* >>> Senior Developer >>> UNH InterOperability Lab >>> ahassick@iol.unh.edu >>> iol.unh.edu >>> +1 (603) 475-8248 >> > > > > -- > *Adam Hassick* > Senior Developer > UNH InterOperability Lab > ahassick@iol.unh.edu > iol.unh.edu > +1 (603) 475-8248 --------------zJ3OEYrcU8dBLinkGB8bw4cJ Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Hi Adam,

> Does the test engine prefer to use IPv6 over IPv4 for initiating the RCF connection to the test bed hosts? And if so, is there a way to force it to use IPv4?

Brilliant idea. If DNS returns both IPv4 and IPv6 addresses in your case, I guess it is the root cause of the problem.
Of course, it is TE problem since I see really weird code in lib/comm_net_engine/comm_net_engine.c line 135.

I've pushed fix to the branch user/arybchik/fix_ipv4_only in ts-factory/test-environment repository. Please, try.

It is late night fix with minimal testing and no review. I'll pass it through review process tomorrow and
hopefully it will be released in one-two days.

Andrew.

On 8/28/23 18:02, Adam Hassick wrote:
Hi Andrew,

We have yet to notice a distinct pattern with the failures. Sometimes, the RCF will start and connect without issue a few times in a row before failing to connect again. Once the issue begins to occur, neither rebooting all of the hosts (test engine VM, tester, IUT) or deleting all of the build directories (suites, agents, inst) and rebooting the hosts afterward resolves the issue. When it begins working again seems very arbitrary to us.

I do usually try to terminate the test engine with Ctrl+C, but when it hangs while trying to start RCF, that does not work.

Does the test engine prefer to use IPv6 over IPv4 for initiating the RCF connection to the test bed hosts? And if so, is there a way to force it to use IPv4?

 - Adam

On Fri, Aug 25, 2023 at 1:35 PM Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> wrote:
> I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04.

Done. It works fine for me without any issues.

Have you noticed any pattern when it works or does not work?
May be it is a problem of not clean state after termination?
Does it work fine the first time after DUTs reboot?
How do you terminate testing? It should be done using Ctrl+C in terminal where you execute run.sh command.
 In this case it should shutdown gracefully and close all test agents and engine applications.

(I'm trying to understand why you've seen many test agent processes. It should not happen.)

Andrew.

On 8/25/23 17:41, Andrew Rybchenko wrote:
On 8/25/23 17:06, Adam Hassick wrote:
Hi Andrew,

Two of our systems (the Test Engine runner and the DUT host) are running Ubuntu 20.04 LTS, however this morning I noticed that the tester system (the one having issues) is running Ubuntu 22.04 LTS.
This could be the source of the problem. I encountered a dependency issue trying to run the Test Engine on 22.04 LTS, so I downgraded the system. Since the tester is also the host having connection issues, I will try downgrading that system to 20.04, and see if that changes anything.

Unlikely, but who knows. We run tests (DUTs) on Ubuntu 20.04, Ubuntu 22.04, Ubuntu 22.10, Ubuntu 23.04, Debian 11 and Fedora 38 every night.
Right now Debian 11 is used for test engine in nightly regressions.

I'll double-check test engine on Ubuntu 20.04 and Ubuntu 22.04.

I did try passing in the "--vg-rcf" argument to the run.sh script of the test suite after installing valgrind, but there was no additional output that I saw.

Sorry, I should valgrind output should be in valgrind.te_rcf (direction where you run test engine).


I will try pulling in the changes you've pushed up, and will see if that fixes anything.

Thanks,
Adam

On Fri, Aug 25, 2023 at 9:57 AM Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> wrote:
Hello Adam,

On 8/24/23 23:54, Andrew Rybchenko wrote:
I'd like to try to repeat the problem locally. Which Linux distro is running on test engine and agents?

In fact I know one problem with Debian 12 and Fedora 38 and we have
patch in review to fix it, however, the behaviour is different in
this case, so it is unlike the same problem.

I've just published a new tag which fixes known test engine side problems on Debian 12 and Fedora 38.


One more idea is to install valgrind on the test engine host and
run with option --vg-rcf to check if something weird is happening.

What I don't understand right now is why I see just one failed attempt
to connect in your log.txt and then Logger shutdown after 9 minutes.

Andrew.

On 8/24/23 23:29, Adam Hassick wrote:
 > Is there any firewall in the network or on test hosts which could block incoming TCP connection to the port 23571 <http://iol-dts-tester.dpdklab.iol.unh.edu:23571> from the host where you run test engine?

Our test engine host and the testbed are on the same subnet. The connection does work sometimes.

 > If behaviour the same on the next try and you see that test agent is kept running, could you check using
 >
 > # netstat -tnlp
 >
 > that Test Agent is listening on the port and try to establish TCP connection from test agent using
 >
 > $ telnet iol-dts-tester.dpdklab.iol.unh.edu <http://iol-dts-tester.dpdklab.iol.unh.edu:23571> 23571 <http://iol-dts-tester.dpdklab.iol.unh.edu:23571>
 >
 > and check if TCP connection could be established.

I was able to replicate the same behavior again, where it hangs while RCF is trying to start.
Running this command, I see this in the output:

tcp        0      0 0.0.0.0:23571 <http://0.0.0.0:23571>           0.0.0.0:*               LISTEN      18599/ta

So it seems like it is listening on the correct port.
Additionally, I was able to connect to the Tester machine from our Test Engine host using telnet. It printed the PID of the process once the connection was opened.

I tried running the "ta" application manually on the command line, and it didn't print anything at all.
Maybe the issue is something on the Test Engine side.

On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru <mailto:andrew.rybchenko@oktetlabs.ru>> wrote:

    Hi Adam,

     > On the tester host (which appears to be the Peer agent), there
    are four processes that I see running, which look like the test
    agent processes.

    Before the next try I'd recommend to kill these processes.

    Is there any firewall in the network or on test hosts which could
    block incoming TCP connection to the port 23571
    <http://iol-dts-tester.dpdklab.iol.unh.edu:23571> from the host
    where you run test engine?

    If behaviour the same on the next try and you see that test agent is
    kept running, could you check using

    # netstat -tnlp

    that Test Agent is listening on the port and try to establish TCP
    connection from test agent using

    $ telnet iol-dts-tester.dpdklab.iol.unh.edu
    <http://iol-dts-tester.dpdklab.iol.unh.edu:23571> 23571
    <http://iol-dts-tester.dpdklab.iol.unh.edu:23571>

    and check if TCP connection could be established.

    Another idea is to login Tester under root as testing does, get
    start TA command from the log and try it by hands without -n and
    remove extra escaping.

    # sudo PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1
    LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1 /tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571 host=iol-dts-tester.dpdklab.iol.unh.edu:port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell=

    Hopefully in this case test agent directory remains in the /tmp and
    you don't need to copy it as testing does.
    May be output could shed some light on what's going on.

    Andrew.

    On 8/24/23 17:30, Adam Hassick wrote:
    Hi Andrew,

    This is the output that I see in the terminal when this failure
    occurs, after the test agent binaries build and the test engine
    starts:

    Platform default build - pass
    Simple RCF consistency check succeeded
    --->>> Starting Logger...done
    --->>> Starting RCF...rcf_net_engine_connect(): Connection timed
    out iol-dts-tester.dpdklab.iol.unh.edu:23571
    <http://iol-dts-tester.dpdklab.iol.unh.edu:23571>

    Then, it hangs here until I kill the "te_rcf" and "te_tee"
    processes. I let it hang for around 9 minutes.

    On the tester host (which appears to be the Peer agent), there are
    four processes that I see running, which look like the test agent
    processes.

    ta.Peer is an empty file. I've attached the log.txt from this run.

     - Adam

    On Thu, Aug 24, 2023 at 4:22 AM Andrew Rybchenko
    <andrew.rybchenko@oktetlabs.ru
    <mailto:andrew.rybchenko@oktetlabs.ru>> wrote:

        Hi Adam,

        Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've double-checked
        that it goes to 'copy_timeout' in ts-conf/rcf.conf.
        Description in in doc/sphinx/pages/group_te_engine_rcf.rst
        says that copy_timeout is in seconds and implementation in
        lib/rcfunix/rcfunix.c passes the value to select() tv_sec.
        Theoretically select() could be interrupted by signal, but I
        think it is unlikely here.

        I'm not sure that I understand what do you mean by RCF
        connection timeout. Does it happen on TE startup when RCF
        starts test agents. If so, TE_RCFUNIX_TIMEOUT could help. Or
        does it happen when tests are in progress, e.g. in the middle
        of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated and most
        likely either host with test agent dies or test agent itself
        crashes. It would be easier for me if classify it if you share
        text log (log.txt, full or just corresponding fragment with
        some context). Also content of ta.DPDK or ta.Peer file
        depending on which agent has problems could shed some light.
        Corresponding files contain stdout/stderr of test agents.

        Andrew.

        On 8/23/23 17:45, Adam Hassick wrote:
        Hi Andrew,

        I've set up a test rig repository here, and have created
        configurations for our development testbed based off of the
        examples.
        We've been able to get the test suite to run manually on
        Mellanox CX5 devices once.
        However, we are running into an issue where, when RCF starts,
        the RCF connection times out very frequently. We aren't sure
        why this is the case.
        It works sometimes, but most of the time when we try to run
        the test engine, it encounters this issue.
        I've tried changing the RCF port by setting
        "TE_RCF_PORT=<some port number>" and rebooting the testbed
        machines. Neither seems to fix the issue.

        It also seems like the timeout takes far longer than 60
        seconds, even when running "export TE_RCFUNIX_TIMEOUT=60"
        before I try to run the test suite.
        I assume the unit for this variable is seconds?

        Thanks,
        Adam

        On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick
        <ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>> wrote:

            Hi Andrew,

            Thanks, I've cloned the example repository and will start
            setting up a configuration for our development testbed
            today. I'll let you know if I run into any difficulties
            or have any questions.

             - Adam

            On Sun, Aug 20, 2023 at 4:40 AM Andrew Rybchenko
            <andrew.rybchenko@oktetlabs.ru
            <mailto:andrew.rybchenko@oktetlabs.ru>> wrote:

                Hi Adam,

                I've published
                https://github.com/ts-factory/ts-rigs-sample
                <https://github.com/ts-factory/ts-rigs-sample>.
                Hopefully it will help to define your test rigs and
                successfully run some tests manually. Feel free to
                ask any questions and I'll answer here and try to
                update documentation.

                Meanwhile I'll prepare missing bits for steps (2) and
                (3).
                Hopefully everything is in place for step (4), but we
                need to make steps (2) and (3) first.

                Andrew.

                On 8/18/23 21:40, Andrew Rybchenko wrote:
                Hi Adam,

                > I've conferred with the rest of the team, and we
                think it would be best to move forward with mainly
                option B.

                OK, I'll provide the sample on Monday for you. It is
                almost ready right now, but I need to double-check
                it before publishing.

                Regards,
                Andrew.

                On 8/17/23 20:03, Adam Hassick wrote:
                Hi Andrew,

                I'm adding the CI mailing list to this
                conversation. Others in the community might find
                this conversation valuable.

                We do want to run testing on a regular basis. The
                Jenkins integration will be very useful for us, as
                most of our CI is orchestrated by Jenkins.
                I've conferred with the rest of the team, and we
                think it would be best to move forward with mainly
                option B.
                If you would like to know anything about our
                testbeds that would help you with creating an
                example ts-rigs repo, I'd be happy to answer any
                questions you have.

                We have multiple test rigs (we call these
                "DUT-tester pairs") that we run our existing
                hardware testing on, with differing network
                hardware and CPU architecture. I figured this might
                be an important detail.

                Thanks,
                Adam

                On Thu, Aug 17, 2023 at 11:44 AM Andrew Rybchenko
                <andrew.rybchenko@oktetlabs.ru
                <mailto:andrew.rybchenko@oktetlabs.ru>> wrote:

                    Greatings Adam,

                    I'm happy to hear that you're trying to bring
                    it up.

                    As I understand the final goal is to run it on
                    regular basis. So, we need to make it properly
                    from the very beginning.
                    Bring up of all features consists of 4 steps:

                    1. Create site-specific repository (we call it
                    ts-rigs) which contains information about test
                    rigs and other site-specific information like
                    where to send mails, where to store logs etc.
                    It is required for manual execution as well,
                    since test rigs description is essential. I'll
                    return to the topic below.

                    2. Setup logs storage for automated runs.
                    Basically it is a disk space plus apache2 web
                    server with few CGI scripts which help a lot to
                    save disk space.

                    3. Setup Bublik web application which provides
                    web interface to view testing results. Same as
                    https://ts-factory.io/bublik
                    <https://ts-factory.io/bublik>

                    4. Setup Jenkins to run tests on regularly,
                    save logs in log storage (2) and import it to
                    bublik (3).

                    Last few month we spent on our homework to make
                    it simpler to bring up automated execution
                    using Jenkins -
                    https://github.com/ts-factory/te-jenkins
                    <https://github.com/ts-factory/te-jenkins>
                    Corresponding bits in dpdk-ethdev-ts will be
                    available tomorrow.

                    Let's return to the step (1).

                    Unfortunately there is no publicly available
                    example of the ts-rigs repository since
                    sensitive site-specific information is located
                    there. But I'm ready to help you to create it
                    for UNH. I see two options here:

                    (A) I'll ask questions and based on your
                    answers will create the first draft with my
                    comments.

                    (B) I'll make a template/example ts-rigs repo,
                    publish it and you'll create UNH ts-rigs based
                    on it.

                    Of course, I'll help to debug and finally bring
                    it up in any case.

                    (A) is a bit simpler for me and you, but (B) is
                    a bit more generic and will help other
                    potential users to bring it up.
                    We can combine (A)+(B). I.e. start from (A).
                    What do you think?

                    Thanks,
                    Andrew.

                    On 8/17/23 15:18, Konstantin Ushakov wrote:
                    Greetings Adam,


                    Thanks for contacting us. I copy Andrew who
                    would be happy to help

                    Thanks,
                    Konstantin

                    On 16 Aug 2023, at 21:50, Adam Hassick
                    <ahassick@iol.unh.edu>
                    <mailto:ahassick@iol.unh.edu> wrote:

                    
                    Greetings Konstantin,

                    I am in the process of setting up the DPDK
                    Poll Mode Driver test suite as an addition to
                    our testing coverage for DPDK at the UNH lab.

                    I have some questions about how to set the
                    test suite arguments.

                    I have been able to configure the Test Engine
                    to connect to the hosts in the testbed. The
                    RCF, Configurator, and Tester all begin to
                    run, however the prelude of the test suite
                    fails to run.

                    https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters <https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters>

                    The documentation mentions that there are
                    several test parameters for the test suite,
                    like for the IUT test link MAC, etc. These
                    seem like they would need to be set somewhere
                    to run many of the tests.

                    I see in the Test Engine documentation, there
                    are instructions on how to create new
                    parameters for test suites in the Tester
                    configuration, but there is nothing in the
                    user guide or in the Tester guide for how to
                    set the arguments for the parameters when
                    running the test suite that I can find. I'm
                    not sure if I need to write my own Tester
                    config, or if I should be setting these in
                    some other way.

                    How should these values be set?

                    I'm also not sure what environment
                    variables/arguments are strictly necessary or
                    which are optional.

                    Regards,
                    Adam

                    --                     *Adam Hassick*
                    Senior Developer
                    UNH InterOperability Lab
                    ahassick@iol.unh.edu
                    <mailto:ahassick@iol.unh.edu>
                    iol.unh.edu <https://www.iol.unh.edu/>
                    +1 (603) 475-8248



                --                 *Adam Hassick*
                Senior Developer
                UNH InterOperability Lab
                ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
                iol.unh.edu <https://www.iol.unh.edu/>
                +1 (603) 475-8248




            --             *Adam Hassick*
            Senior Developer
            UNH InterOperability Lab
            ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
            iol.unh.edu <https://www.iol.unh.edu/>
            +1 (603) 475-8248



        --         *Adam Hassick*
        Senior Developer
        UNH InterOperability Lab
        ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
        iol.unh.edu <https://www.iol.unh.edu/>
        +1 (603) 475-8248



    --     *Adam Hassick*
    Senior Developer
    UNH InterOperability Lab
    ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
    iol.unh.edu <https://www.iol.unh.edu/>
    +1 (603) 475-8248



-- 
*Adam Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
iol.unh.edu <https://www.iol.unh.edu/>
+1 (603) 475-8248




--
Adam Hassick
Senior Developer
UNH InterOperability Lab
+1 (603) 475-8248




--
Adam Hassick
Senior Developer
UNH InterOperability Lab
+1 (603) 475-8248

--------------zJ3OEYrcU8dBLinkGB8bw4cJ--