Hello Adam,
On 8/24/23 23:54, Andrew Rybchenko wrote:
I'd
like to try to repeat the problem locally. Which Linux distro is
running on test engine and agents?
In fact I know one problem with Debian 12 and Fedora 38 and we
have
patch in review to fix it, however, the behaviour is different in
this case, so it is unlike the same problem.
I've just published a new tag which fixes known test engine side
problems on Debian 12 and Fedora 38.
One more idea is to install valgrind on the test engine host and
run with option --vg-rcf to check if something weird is happening.
What I don't understand right now is why I see just one failed
attempt
to connect in your log.txt and then Logger shutdown after 9
minutes.
Andrew.
On 8/24/23 23:29, Adam Hassick wrote:
> Is there any firewall in the network
or on test hosts which could block incoming TCP connection to
the port 23571
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571> from the
host where you run test engine?
Our test engine host and the testbed are on the same subnet. The
connection does work sometimes.
> If behaviour the same on the next try and you see that
test agent is kept running, could you check using
>
> # netstat -tnlp
>
> that Test Agent is listening on the port and try to
establish TCP connection from test agent using
>
> $ telnet iol-dts-tester.dpdklab.iol.unh.edu
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571> 23571
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571>
>
> and check if TCP connection could be established.
I was able to replicate the same behavior again, where it hangs
while RCF is trying to start.
Running this command, I see this in the output:
tcp 0 0 0.0.0.0:23571
<http://0.0.0.0:23571> 0.0.0.0:*
LISTEN 18599/ta
So it seems like it is listening on the correct port.
Additionally, I was able to connect to the Tester machine from
our Test Engine host using telnet. It printed the PID of the
process once the connection was opened.
I tried running the "ta" application manually on the command
line, and it didn't print anything at all.
Maybe the issue is something on the Test Engine side.
On Thu, Aug 24, 2023 at 2:35 PM Andrew Rybchenko
<andrew.rybchenko@oktetlabs.ru
<mailto:andrew.rybchenko@oktetlabs.ru>> wrote:
Hi Adam,
> On the tester host (which appears to be the Peer
agent), there
are four processes that I see running, which look like the
test
agent processes.
Before the next try I'd recommend to kill these processes.
Is there any firewall in the network or on test hosts which
could
block incoming TCP connection to the port 23571
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571> from
the host
where you run test engine?
If behaviour the same on the next try and you see that test
agent is
kept running, could you check using
# netstat -tnlp
that Test Agent is listening on the port and try to
establish TCP
connection from test agent using
$ telnet iol-dts-tester.dpdklab.iol.unh.edu
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571>
23571
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571>
and check if TCP connection could be established.
Another idea is to login Tester under root as testing does,
get
start TA command from the log and try it by hands without -n
and
remove extra escaping.
# sudo PATH=${PATH}:/tmp/linux_x86_root_76872_1692885663_1
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/tmp/linux_x86_root_76872_1692885663_1
/tmp/linux_x86_root_76872_1692885663_1/ta Peer 23571
host=iol-dts-tester.dpdklab.iol.unh.edu:port=23571:user=root:key=/opt/tsf/keys/id_ed25519:ssh_port=22:copy_timeout=15:kill_timeout=15:sudo=:shell=
Hopefully in this case test agent directory remains in the
/tmp and
you don't need to copy it as testing does.
May be output could shed some light on what's going on.
Andrew.
On 8/24/23 17:30, Adam Hassick wrote:
Hi Andrew,
This is the output that I see in the terminal when this
failure
occurs, after the test agent binaries build and the test
engine
starts:
Platform default build - pass
Simple RCF consistency check succeeded
--->>> Starting Logger...done
--->>> Starting RCF...rcf_net_engine_connect():
Connection timed
out iol-dts-tester.dpdklab.iol.unh.edu:23571
<http://iol-dts-tester.dpdklab.iol.unh.edu:23571>
Then, it hangs here until I kill the "te_rcf" and "te_tee"
processes. I let it hang for around 9 minutes.
On the tester host (which appears to be the Peer agent),
there are
four processes that I see running, which look like the
test agent
processes.
ta.Peer is an empty file. I've attached the log.txt from
this run.
- Adam
On Thu, Aug 24, 2023 at 4:22 AM Andrew Rybchenko
<andrew.rybchenko@oktetlabs.ru
<mailto:andrew.rybchenko@oktetlabs.ru>> wrote:
Hi Adam,
Yes, TE_RCFUNIX_TIMEOUT is in seconds. I've
double-checked
that it goes to 'copy_timeout' in ts-conf/rcf.conf.
Description in in
doc/sphinx/pages/group_te_engine_rcf.rst
says that copy_timeout is in seconds and
implementation in
lib/rcfunix/rcfunix.c passes the value to select()
tv_sec.
Theoretically select() could be interrupted by signal,
but I
think it is unlikely here.
I'm not sure that I understand what do you mean by RCF
connection timeout. Does it happen on TE startup when
RCF
starts test agents. If so, TE_RCFUNIX_TIMEOUT could
help. Or
does it happen when tests are in progress, e.g. in the
middle
of a test. If so, TE_RCFUNIX_TIMEOUT is unrelated and
most
likely either host with test agent dies or test agent
itself
crashes. It would be easier for me if classify it if
you share
text log (log.txt, full or just corresponding fragment
with
some context). Also content of ta.DPDK or ta.Peer file
depending on which agent has problems could shed some
light.
Corresponding files contain stdout/stderr of test
agents.
Andrew.
On 8/23/23 17:45, Adam Hassick wrote:
Hi Andrew,
I've set up a test rig repository here, and have
created
configurations for our development testbed based off
of the
examples.
We've been able to get the test suite to run
manually on
Mellanox CX5 devices once.
However, we are running into an issue where, when
RCF starts,
the RCF connection times out very frequently. We
aren't sure
why this is the case.
It works sometimes, but most of the time when we try
to run
the test engine, it encounters this issue.
I've tried changing the RCF port by setting
"TE_RCF_PORT=<some port number>" and rebooting
the testbed
machines. Neither seems to fix the issue.
It also seems like the timeout takes far longer than
60
seconds, even when running "export
TE_RCFUNIX_TIMEOUT=60"
before I try to run the test suite.
I assume the unit for this variable is seconds?
Thanks,
Adam
On Mon, Aug 21, 2023 at 10:19 AM Adam Hassick
<ahassick@iol.unh.edu
<mailto:ahassick@iol.unh.edu>> wrote:
Hi Andrew,
Thanks, I've cloned the example repository and
will start
setting up a configuration for our development
testbed
today. I'll let you know if I run into any
difficulties
or have any questions.
- Adam
On Sun, Aug 20, 2023 at 4:40 AM Andrew Rybchenko
<andrew.rybchenko@oktetlabs.ru
<mailto:andrew.rybchenko@oktetlabs.ru>>
wrote:
Hi Adam,
I've published
https://github.com/ts-factory/ts-rigs-sample
<https://github.com/ts-factory/ts-rigs-sample>.
Hopefully it will help to define your test
rigs and
successfully run some tests manually. Feel
free to
ask any questions and I'll answer here and
try to
update documentation.
Meanwhile I'll prepare missing bits for
steps (2) and
(3).
Hopefully everything is in place for step
(4), but we
need to make steps (2) and (3) first.
Andrew.
On 8/18/23 21:40, Andrew Rybchenko wrote:
Hi Adam,
> I've conferred with the rest of the
team, and we
think it would be best to move forward
with mainly
option B.
OK, I'll provide the sample on Monday for
you. It is
almost ready right now, but I need to
double-check
it before publishing.
Regards,
Andrew.
On 8/17/23 20:03, Adam Hassick wrote:
Hi Andrew,
I'm adding the CI mailing list to this
conversation. Others in the community
might find
this conversation valuable.
We do want to run testing on a regular
basis. The
Jenkins integration will be very useful
for us, as
most of our CI is orchestrated by
Jenkins.
I've conferred with the rest of the
team, and we
think it would be best to move forward
with mainly
option B.
If you would like to know anything about
our
testbeds that would help you with
creating an
example ts-rigs repo, I'd be happy to
answer any
questions you have.
We have multiple test rigs (we call
these
"DUT-tester pairs") that we run our
existing
hardware testing on, with differing
network
hardware and CPU architecture. I figured
this might
be an important detail.
Thanks,
Adam
On Thu, Aug 17, 2023 at 11:44 AM Andrew
Rybchenko
<andrew.rybchenko@oktetlabs.ru
<mailto:andrew.rybchenko@oktetlabs.ru>> wrote:
Greatings Adam,
I'm happy to hear that you're trying
to bring
it up.
As I understand the final goal is to
run it on
regular basis. So, we need to make
it properly
from the very beginning.
Bring up of all features consists of
4 steps:
1. Create site-specific repository
(we call it
ts-rigs) which contains information
about test
rigs and other site-specific
information like
where to send mails, where to store
logs etc.
It is required for manual execution
as well,
since test rigs description is
essential. I'll
return to the topic below.
2. Setup logs storage for automated
runs.
Basically it is a disk space plus
apache2 web
server with few CGI scripts which
help a lot to
save disk space.
3. Setup Bublik web application
which provides
web interface to view testing
results. Same as
https://ts-factory.io/bublik
<https://ts-factory.io/bublik>
4. Setup Jenkins to run tests on
regularly,
save logs in log storage (2) and
import it to
bublik (3).
Last few month we spent on our
homework to make
it simpler to bring up automated
execution
using Jenkins -
https://github.com/ts-factory/te-jenkins
<https://github.com/ts-factory/te-jenkins>
Corresponding bits in dpdk-ethdev-ts
will be
available tomorrow.
Let's return to the step (1).
Unfortunately there is no publicly
available
example of the ts-rigs repository
since
sensitive site-specific information
is located
there. But I'm ready to help you to
create it
for UNH. I see two options here:
(A) I'll ask questions and based on
your
answers will create the first draft
with my
comments.
(B) I'll make a template/example
ts-rigs repo,
publish it and you'll create UNH
ts-rigs based
on it.
Of course, I'll help to debug and
finally bring
it up in any case.
(A) is a bit simpler for me and you,
but (B) is
a bit more generic and will help
other
potential users to bring it up.
We can combine (A)+(B). I.e. start
from (A).
What do you think?
Thanks,
Andrew.
On 8/17/23 15:18, Konstantin Ushakov
wrote:
Greetings
Adam,
Thanks for contacting us. I copy
Andrew who
would be happy to help
Thanks,
Konstantin
On 16 Aug
2023, at 21:50, Adam Hassick
<ahassick@iol.unh.edu>
<mailto:ahassick@iol.unh.edu> wrote:
Greetings Konstantin,
I am in the process of setting
up the DPDK
Poll Mode Driver test suite as
an addition to
our testing coverage for DPDK at
the UNH lab.
I have some questions about how
to set the
test suite arguments.
I have been able to configure
the Test Engine
to connect to the hosts in the
testbed. The
RCF, Configurator, and Tester
all begin to
run, however the prelude of the
test suite
fails to run.
https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters
<https://ts-factory.io/doc/dpdk-ethdev-ts/index.html#test-parameters>
The documentation mentions that
there are
several test parameters for the
test suite,
like for the IUT test link MAC,
etc. These
seem like they would need to be
set somewhere
to run many of the tests.
I see in the Test Engine
documentation, there
are instructions on how to
create new
parameters for test suites in
the Tester
configuration, but there is
nothing in the
user guide or in the Tester
guide for how to
set the arguments for the
parameters when
running the test suite that I
can find. I'm
not sure if I need to write my
own Tester
config, or if I should be
setting these in
some other way.
How should these values be set?
I'm also not sure what
environment
variables/arguments are strictly
necessary or
which are optional.
Regards,
Adam
-- *Adam
Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu
<mailto:ahassick@iol.unh.edu>
iol.unh.edu
<https://www.iol.unh.edu/>
+1 (603) 475-8248
-- *Adam Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu
<mailto:ahassick@iol.unh.edu>
iol.unh.edu
<https://www.iol.unh.edu/>
+1 (603) 475-8248
-- *Adam Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu
<mailto:ahassick@iol.unh.edu>
iol.unh.edu <https://www.iol.unh.edu/>
+1 (603) 475-8248
-- *Adam Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu
<mailto:ahassick@iol.unh.edu>
iol.unh.edu <https://www.iol.unh.edu/>
+1 (603) 475-8248
-- *Adam Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
iol.unh.edu <https://www.iol.unh.edu/>
+1 (603) 475-8248
--
*Adam Hassick*
Senior Developer
UNH InterOperability Lab
ahassick@iol.unh.edu <mailto:ahassick@iol.unh.edu>
iol.unh.edu <https://www.iol.unh.edu/>
+1 (603) 475-8248