Workplan     status from 08/06/05

Introduction

The structure of the workplan is grouped into the different types of testing activities. It also follows the phases given by the the available farm size at a give time. The testing activities are the following:

  1. Tdaq system verification: At the beginning of each phase, one of a few partitons of the integrated system should be run in order to verify the basic functionality of the system. This verification should follow the steps 1-4 as described for the integrated system tests.  These are basic functionality tests, no performance test are executed.
  2. Integration tests: The system is assembles according to the steps 1-4 described below. A selection of different parameters or partition structure leads to variants of the basic partitions to be tested. Sufficient performance tests are executed to obtain statistically reliable values.
  3. Individual tests: Tests are executed which emphasize the usage of a particular Component. Some of those test are run in a partition, others run independently.
  4. Sub-system test: Special emphasis is given to understand certain aspects of parts of the system. Performance tests are executed in some cases.
  5. HLT-S tests: HLT test are grouped into two groups of tests: The tests to install the HLT-S software combined with the offline software, and the tests to run a partition on the ins talled software. The time to install the software on large number of pc's will be measured.

Schedule Overview:

The table for the global schedule indicates the main testing activities per testing phase. Point 1 as listed above should always be executed at the beginning of a new phase and these verification tests are not specifically listed in the table. Emphasis on the integration testing is put to the second part of the testing period. In the early phases, component testing and LVL2 tests together with the HLT-S installation tests will have priority. The availability of the testers has been taken into account as far as it is known.  Phase 1 is a pre-phase when the system(s) will be installed but can be used as soon as this step is finished successfully. During or after phase 4, the priority for the last 2 phases will be set. Tests which have run successfully up to then and could  give additional information about the behaviour of  the system on an even larger scale will get priority. Integration tests should be run if they had been successful up to this point.

The general schema for each phase is described here, but we need to be  flexible and and details will depend on the actual situation :

 

 

Involved testers:

 

Detailed workplan :

 System integration tests and sub-system tests
        Level2 tests this is a pointer to word document and not listed inline!

Component tests

Controls
Monitoring
Configuration database tests with oracle

CDI tests

HLT-S

Preparation tasks and tools upgrade for the tests

Sub-system and system integration tests

The HLT/DAQ integration tests should follow a series of steps before combining them to the complete system to verify the basic system  functionality for each phase. Then the specific sub-system and integration steps, the component tests and HLT-S tests can follow. Details of these steps are set out below:


1. Step: Online Software System : Doris + Mihai in phase 1,2,3; + someone for phases 4,5,6

run 2 level partition for the phases 1,2, and for the following phases 3 level partitions with 50 local controllers per segment, one IS server per sub-farm controller , 1 rdb server per sub-farm controller

Steps for the online:
1.)online:
topology of control tree with rc_empty_controller:
1 Root controller
    n global controllers ( n to be specified)
    m sub-farm controllers per global controller ( m to be specified)
     p controllers per sub-farm ( p to be specified)
option of one IS server per sub-farm
option of one rdb-server per sub-farm

2) online + ExampleApplication to include an additional application per local controller:
topology of control tree with rc_empty_controller for root and intermediate controllers and the localcontroller for the leafs:
1 Root controller
    n global contollers ( n to be specified)
    m sub-farm controllers per global controller ( m to be specified)
     p controllers per sub-farm ( p to be specified)
1 ExampleApplication per leaf controller node
option of one IS server per sub-farm
option of one rdb-server per sub-farm

3) same as 2) but ExampleApplication replaced by test_app

4) online + is_test_source to introduce  IS traffic:
topology of control tree with rc_empty_controllerfor root and intermediate controllers and the localcontroller for the leafs:
1 Root controller
    n global contollers ( n to be specified)
    m sub-farm controllers per global controller ( m to be specified)
     p controllers per sub-farm ( p to be specified)
1 is_test_source per leaf controller node, communicating with its respective sub-farm IS server
of one IS server per sub-farm
one rdb-server per sub-farm and one is_test_receiver connecting to the corresponding IS server.
Note:
is_test_source should have the followin parameters set: -p <partition-name> -n <server-name> -N 5 -S <name_suffix> -s M
and the default value for the other parameters
<name_suffix> is a string which has to be unique per is_test_source  in the db
It should be set to start at start of run and to stop at stop of run

is_test_receiver -p <partition-name> -n <server-name> -u 10000
It should be set to start at start of run and to stop at stop of run
 

2. Step: ROS + DC: Sonja (Gokhan) for ROS, Per for LVL2, Kostas for DC

a. ONLSW + LVL2 ROI Collection (L2SVs + L2PUs + ROSs)

description of partitions to run: (to be prepared)

estimated time for verification:

estimated time for sub-system tests:

• Test scalability aspect of High Level trigger LVL2 part of the dataflow without
algorithms. Components involved: as in step 1 + ROI Collection + O(200 ROSs, 800
LPUs, 10 L2SVs)
• Study of control aspects, robustness and stability on a large scale
• Take measurements for state transitions for varying number of sub-farms and
number of nodes per sub-farm as a test for farm control
• Inter-operability of components, use of IS and MRS while ROSes use dummy data
Use of UDP and use TCP, possibly also mixed (UDP ROS-L2PU only?) DC networks should be configured and tested.
• Study 2tier and 3tier hierarchies; vary number of sub-farms and number of applications per sub-farm
• Verify functionality for up to 8 L2PU applications per node (few algorithms run multi-threaded; multi-processor nodes)
• Don’t aim to measure performance or scalability of the DF system
• Keep data  rates on DC network VERY low (high burn time) not to spoil the measurements

ROS: functionality: shared mode 1/2 day; no performance tests

LVL2: functionality:

description of partitions to run: (to be prepared)

b. ONLSW + LVL2 ROI Collection + Event Builder

description of partitions to run: (to be prepared)

estimated time for verification:

estimated time for sub-system tests:

• Components involved: as in step 2 a) + Event Builder (DFM, SFIs, ROSs)

 -------------------------------------------------------------------
    LST2005: Work Plan for the EventBuilding DataCollection part
 -------------------------------------------------------------------

 Aim:
 ----
   Test scalability aspect of the EventBuilding part of the dataflow.

 General:
 --------
 *  Study of control aspects, robustness and stability on a full scale of 
    the ATLAS EventBuilder
 *  Take measurements for state transitions for varying sizes of the 
    EventBuilder as a test for RunControl
 *  Use of UDP and use of TCP
 *  Don't aim to measure performance or scalability of the EventBuilder; 
    i.e. very low data rates, and only a few dummy events built

 Details:
 --------
 1) In the first phases of the Large Scale Tests, we want to run with a 
    partition having an EB segment with: one DFM and a few SFI's plus a 
    ROS segment with a few ROS's (ROS-to-SFI ratio around 3:2; 
    e.g., 15 ROS's and 10 SFI's). This is to prepare, test and 
    understand the test procedures defined in step 2) and 3) below.
   
 2) When we have around 300 machines in the LST05, we'd like to use a
    full scale version of the EB and ROS segments above: 
    still have one DFM, but now have ~150 ROS's and ~100 SFI's, as 
    foreseen for the final ATLAS DAQ system.
    Also a few intermediate steps will be measured with 1/4; 1/2; and 
    3/4 of the full scale system.

 3) Integrate with ONLSW + LVL2 ROI Collection + Event Builder tests 
    [i.e., step 2b) of LST workplan]:
    Go back to a smaller EB segment, always with one DFM and
    ROS-to-SFI machines in the ratio of 3:2 (e.g., 12 ROS's, 8 SFI's) and
    allocate the available PCs to the LVL2 subfarms for testing.

 4) Integration: ONLSW + L2 ROI Collection + Event Builder + EF:
    [i.e., step 4) of LST workplan]:
    As step 3) above, but EF subfarms added. SFIs need to connect to EFDs.
    Need to add a few (max 5) SFOs in a SFO segment.

 5) If at the end of the tests, as we approach the 19th of July, we have 
    done all the tests we want with the Online, LVL2 and EF systems, then
    we could check an EB and ROS segment with twice the size of the full 
    ATLAS configuration: 1 DFM, 300 ROS's and 200 SFI's just to see if we 
    hit a wall there. 
    This should not have the highest priority in the overall planning, 
    nevertheless, if there is some time available this would be still a 
    useful test.

3. Step 3: ONLSW + EF (Serge and Andrea from phase 3 on,  in phase 1,2 Doris)

description of partitions to run: (to be prepared)

estimated time for verification:

estimated time for sub-system tests:

• Verify robustness of the EF control and operational monitoring
• Verify that the performance is in the expected range
• Study load balancing on PTs
• Study EF communication for SFI, EFD, SFO
• Take measurements for state transitions for varying number of sub-farms and
number of nodes per sub-farm as a test for farm control

1) Tests of TDAQ framework from the side of Event  Filter components (EFD/PT):
WestGrid tests showed that the communication between EFD/PT  or  with other components of TDAQ may be an additional source of problems. Thus we would like to test (with only PTDummy, NO Athena):
   - stability of FSM transitions
   - EFD-PT, EFD/PT--IS/LC/IPC communications
   - PT/EFD timeouts and their relations/influence
     to TDAQ timeouts
   - robustness of TDAQ system in case of problems
     on EFD/PT side (killing PTs/EFDs, "Ignore/Restart"
     flags for EFDs/PTs)
   - influence of network/local_disk installations
     on performance (RDB/OKS/IPC_INIT_REF as file/jacorb)
   - check/unify error/logging output from EFD/PTs

Q2: some scalability problems appeared during the WestGrid large
    scale tests.
  Q3: Andrea, Serge, Sander, Haimo
  Q4: yes, Q5: yes
  Q6: can run shared. exclusive access needed only in case of problems
  Q7: as many nodes as possible
  Q8: no

2)  separate issue is comparison of performances of  EF farms of different topologies (also only
   PTDummy, NO Athena):
   - find out optimal size of SubFarm (e.g. 10x60 or
     20x30 or 30x20, etc.)
   - test scenarios with IS pers SubFarm / LocalController
     per node.
   - also, with PTDummy we can test the influence of event_size and CPU_BurningTime on performance

 Q2: by definition
  Q3: Andrea, Serge, Sander, Haimo
  Q4: yes, Q5: yes
  Q6: can run shared. exclusive access needed only in case of problems
  Q7: as many nodes as possible
  Q8: no


4. Step 4: integration: ONLSW + L2 ROI Collection + Event Builder + EF (step 2+ step 3 together)

Doris + Sonja

description of partitions to run: (to be prepared)

estimated time for verification:

estimated time for sub-system tests:

• Test integrated functionality
• Take Measurements for state transitions for increasing number of elements
• Verify robustness with the help of operational monitoring while in running state
Note that the so-called standalone tests of both the LVL2 and EF software are in fact already binary
tests of the LVL2 and EF software respectively with the Online Software. Although some of the control
software has been customized specifically for use in the HLT (e.g. the Local Controller implementation
of the Run Control) the HLT supervision system as a whole relies heavily on the Online Software
infrastructure for process control, inter-process communication, monitoring, etc.
Component tests for the Online Software Information Service were planned to be performed at the
beginning of the tests in order to use the resulting information in the optimization of configuration
parameters for the subsequent integration tests. Tests for the Configuration Databases, Setup, and for
Web services were planned to be performed at convenient times, independently of the integration steps
described.

Controls

 Giovanna coordinating the test

estimated time for the tests:

for some of our new components:

1) Access Manager (Marius):
Given a large partition (might be purely dummy or contain real DF 
applications) evaluate the performance loss at setup 
and transitions introduced by switching on the access management. Some 
time should be allowed to try to optimize AM (cache lifetime) and Corba 
parameters. For using the AM we need 1 machine configured as a MYSQL 
server, containing the AM database. These tests should be performed 
towards the end of the testing period, if and when the rest of the system 
is fairly stable.
At every change of size of the system, a partition should be run and 
timed 
with the AM deactivated and then with the AM active. Time differences for 
booting and transitions shall be recorded. If feasible, the AM should be 
then always kept active.
- To use the AM we need a mysql server and a users database. Marius will 
follow up with Gokhan that we have a machine installed for this purpose on 
lxshare.
- John-Erik and Marius will provide instructions on how to 
activate/deactivate authorization via the AM. (1 Variable in the 
database, at Partition level).
- Marius will execute the timing tests for this component.

2) DVS (Alina):
Scalability tests. Some time should be dedicated to see the performance of 
DVS, when very large number of tests are launched. In general, DVS shall 
be used throughout the testing period, for testing and logs browsing, 
etc...

3) Setup(Andrei):
Setup shall be used for all tests. This LST period shall definitely 
assess its functioning. Dedicated time should be scheduled to see how well 
we manage to recover when infrastructure applications are killed. 
- No specific test, but assistance from Andrei if there are problems.
- Andrei will check that the text_gui works from within setup.

4) New PMG (Marc): on dedicated release
Dedicated time to start/stop/monitor large number of processes and measure 
performance. These tests will be made with dedicated test programs, not 
with the DAQ. When these tests are ongoing the system shall be locked, in 
order not to mix up old and new pmg agents. We would like to get 2 days 
for this (one between 21-26 of June and one towards the end of the LST 
period). The second day might be dropped if everything works fine during 
the first tests.Tests will first be done on building 32 testbed. Then need 
1/2 day exclusive mode.
- tests will be first carried out on bdg 32.
- If successful, 1 day of parasitic running on (part of) the cluster 
will be asked (beginning of July) to install the software needed and do 
preliminary tests, followed by half day of measurements (exclusive usage)
- If the first tests will have been succesfull another timeslot towards 
the end will be requested to run some automated measurements (can be done 
over night).

4) New RC (Giovanna): on dedicated release
Dedicated tests on large (dummy) partitions to verify the functioning of 
the new run control and supervision. If everything works fine the new RC 
could be used for real partitions as well. These tests will require the 
installation of a second release containing the apps linked against the 
new RC. Tests could start after June 26th. Tests will first be done on 
building 32 testbed. Then need 1/2 day exclusive mode. If tests are 
successful, new controller could be used from then on. (boot and shutdown 
are expected to be faster because of distributed dsa_supervisor.
- Tests will first be carried out in bdg 32.
- If successful, during July we will run a few equivalent partitions with 
old and new Controller and check if the behavior is similar, performance 
wise. One day will be enough for this. During the night we might then be 
able to make comparative automated timing tests.
- Another slot of automated tests will be asked towards the end of the 
period, if the first tests were successful.
- current RC:
- at every change of size, run a partition with Controllers and 
ExampleApplications only to test completely the run control chain.
- if there will be spare time, applications and controllers shall be 
killed on purpose, randomly, to check that they are restarted/ignored as 
specified in the DB and that the system behaves as expected.
 
5) IGUI  Mihai 
Tests on large partitions on particular aspects of the scalability 
of the IGUI panels. Last year the IGUI has been used for 
configurations including up to 150 nodes. We will 
check the behaviour for larger configurations (for example 400 or 500 nodes), 
especially for PMG, Run Control, DataFlow and MRS panels. 
- During setup time (at each relevant size change) the IGUI shall be tried 
out to check its scalability. 
- setup_daq shall always be started without GUI, then we will try out 2 
different ways of running the IGUI: either started on a local machine 
outside the lxshare cluster (fast graphics, but potentially large 
latencies over the CERN network to send commands, etc...) or started on a 
dedicated machine within the lxshare cluster with a remote display (fast 
command distribution, but slow graphics response).
- the general shifter in charge of setting up the system shall use the 
IGUI, assisted by Mihai, if/when needed.
Log Service:
- it is requested that a partition stays up for sometime (30 minutes).
- the log server will be started parasitically, but it might have an 
impact on the load in the mrs_server (keep an eye on CPU load!).
- Log consumers will be requesting data and we will evaluate at which 
point log messages will start to be dropped.
- this exercise should be repeated at each size change, if no evident 
problems have appeared.
- Raul will contact Gokhan to tell him what is needed from an installation 
point of view (mysql and web servers).
- These tests can be done with any partition. Raul will measure the 
behaviour of the LS.
 
------------
The developers will be running the specific tests. that's why 
the first week is not suited for the pmg. For what concerns the AM, I will 
run the tests myself, since John-Erik will be leaving for good at the 
beginning of June. He has started tests on medium scale already (bdg 32). 
In general people in Controls will participate to the tests also when 
running in integrated mode. We'll have to sort out the coverage of the 
full period.
Absences:
Alina: 09 June - 26 June
Roul: 1.7 - 5.7 (possibly longer) and 18.7 - 29.7.
Giovanna: 11.6 - 26.6,  16.7 - 24.7.
Andrei: 2.07-25.07
Marc: 11.6 - 19.6, 11.7 - 17.7 (or 13.7 - 17.7)
Mihai: 08.07 - 01.08.
Several people available for general shift for a few days, possible times are known

Monitoring

Sergei coordinating the tests

Here is the test plan for the Monitoring Working Group.

  1. *what is the purpose of the tests, which aspect of the system is
     investigated, what are the objectives (description of the test)?*

The main aim is to test scalability and performance of the IS, OH and Emon components.

  2. *why can these tests only be done on a very large scale, what are
     the challenges and/or expected results?*

Each of those components should be able to serve a couple of thousand clients, therefore in
order to verify their functioning in this conditions and measure their performance one
should have a couple of hundred computers to run those clients.

  3. *who will prepare the tests, who will participate in the test
     preparation (be the contact person, participate in meetings),  run
     the tests and make the results available as part of the test report?*

The tests for the IS and OH will be prepared and done by Serguei Kolos. Tests for Emon
will be done by Ingo Scholtes.

  4. *will the tests have been tested and run on a medium scale on one
     of the available farms before?*

Probably not on the medium scale farms, but we will make some standalone tests in order to verify
the functionality of the components. 

  5. *can the tests be automated? Can they run autonomously, i.e. over
     night or during parts of the week-end?*

Yes, all or at least most of the tests can run autonomously.

  6. *do the tests require exclusive access to the farm or can they run
     in 'shared' mode?*

All the tests will require exclusive access to the farm.

  7. *how many nodes and how much testing time do they require a) in
     shared mode b) in exclusive mode?*

The number of nodes is not a crucial issue since we can vary a number of clients per node. Of course a high number of machines is preferable, however I would say that any number of computers between 300 and 700 is suitable. What is really important is that these computers must be used in exclusive mode for those tests.

  8. *are there any specific requirements on system parameters or any
     constraints?*

Nothing special apart from the general LST requirements.

requested testing time: 1 day share mode, 3 x 1 night for automatic tests exclusive mode; on 400-500 nodes => around LST phase 4

Absence of Sergei: 5-18th of June. Ingo will come for 4-5 days starting from 21st of June

Configuration database tests

Igor coordinating the tests

1. what is the purpose of the tests, which aspect of the system is investigated, what are the objectives (description of the test)?
To test performance of the online configuration system using POOL RAL / Oracle as storage backend instead of xml files.
Two main use cases are considered:
  1. several tens of RDB servers get and put data directly to relational database instead of xml files; such servers are accessed by the TDAQ applications which need to get configuration parameters;
  2. all TDAQ applications (number of applications is O(1000)) access Oracle server directly to get configurations parameters.
2. why can these tests only be done on a very large scale, what are the challenges and/or expected results?
 
- expected number of used database clients is around 100 for use case 1.1;
- larger number can be tried, if the Oracle server will be scalable well (such number can only be obtained by results of the tests)
- computers to be located in the computing centre to be close (from network point of view) to the Oracle server
- computers and the Oracle server must in exclusive usage
 
3. who will prepare the tests, who will participate in the test preparation (be the contact person, participate in meetings),  run the tests and make the results available as part of the test report?
Igor
 
4. will the tests have been tested and run on a medium scale on one of the available farms before?
 
Yes. Normally 100 computers should be enough; bigger number is desired if the Oracle server will be scalable well for big number of clients. But the bigger number is not mandatory, since multiple DB clients can be run on single computer.
 
5. can the tests be automated? Can they run autonomously, i.e. over night or during parts of the week-end?
For the nights hopefully yes. For week-end probably not, since Oracle DBA intervention can be required.
 
6. do the tests require exclusive access to the farm or can they run in 'shared' mode?
Exclusive. 
 
7. how many nodes and how much testing time do they require a) in shared mode b) in exclusive mode?
Shared mode can only be used for tests preparation. I hope they can be prepared in 1 day of shared mode.
Exclusive mode is requested for 100 nodes for 2-3 days. Such days will be defined by the IT people.
 
8. are there any specific requirements on system parameters or any constraints?
For the moment there are no such requirements and constraints.

Need  an IT oracle server exclusively and testing period will depend on its availability; Envisaged: LST phase 2-3

setting up during shared mode, exclusive automatic tests and can run over night (to be verified)
Absence: Igor: from 11th of July onwards.

CDI tests

Nuno coordinating the tests

The information for my tests are:

1) PURPOSE: The purpose is to check the performance of the new COOL based CDI. In this checks it is intended to see if the new implementation cope with the online community performance requirements. Some of the tests include subscribing data to the IS in regular (measured) intervals of time, checking the limit with which the CDI can cope, using the CDI together with multiple IS servers and using CDI with multiple clients publishing at the same time different (or the same) objects.

2) why can these tests only be done on a very large scale, what are the challenges and/or expected results?

The tests can only be meaningful when made in a environment the most similar possible with the real usage. In practice the CDI will be used by many clients, many ISServers which will most likely publish data to be stored at a very high rate. Although it is possible to make this tests in a single or just a few machines, the meaning will be completely different as we will not have the idea of the effects of parallel data publishing.

3) who will prepare the tests, who will participate in the test preparation (be the contact person, participate in meetings),  run the tests and make the results available as part of the test report?

All these tasks will be my responsibility.  However, during this week it will not be possible to participate in the meetings as I am away.

4 ) will the tests have been tested and run on a medium scale on one of the available farms before?

Not these tests. The CDI was already tested before, but with different goals (more in the sense to check if the underlying Conditions Database was per formant enough). The current tests are intended to test the CDI implementation itself and check how much can be improved in the CDI side. Since the COOL API is a project from CERN IT, these tests will also allow to check which improvements the ATLAS experiment can ask to the COOL development team.

5) can the tests be automated? Can they run autonomously, i.e. over night or during parts of the week-end?

In principle yes. The tests are quite simple, but if something goes wrong it is not clear if the environment will recover to start another set of tests. The best would be to test with someone (me) taking care of them. Anyway the tests can be changed to handle any errors that might occur. The problem is that will require time that I'm not sure to have.

6) do the tests require exclusive access to the farm or can they run in 'shared' mode?

They can be run in shared mode as far as the other processes do not take too much resources or else the time measures will be biased.

7 ) how many nodes and how much testing time do they require a) in shared mode b) in exclusive mode?

In both the tests would not require more than one/two days. One day should be enough. There are not many tests and in anyway the CDI will be implicitly tested in other packages that use it as a mandatory component (eg. the setup_daq application).

8) are there any specific requirements on system parameters or any constraints?

No other than have access to an Oracle server preferably with  exclusive access.

HLT-S

•
–
Haimo coordinating the tests
here is the combined list of proposed tests from EF and LVL2.
For each of the tests, I added the 8 questions from the LST page
"HowTo participate in the test" with the answers from the HLT group.

   Q1. what is the purpose of the tests, which aspect of the system is 
       investigated,  what are the objectives (description of the test)?
   Q2. why can these tests only be done on a very large scale, 
       what are the challenges and/or expected results?
   Q3. who will prepare the tests, who will participate in the test preparation 
       (be the contact person, participate in meetings),  run the tests and 
       make the results available as part of the test report?
   Q4. will the tests have been tested and run on a medium scale on one of the 
       available farms before?
   Q5. can the tests be automated? Can they run autonomously, 
       i.e. over night or during parts of the week-end?
   Q6. do the tests require exclusive access to the farm or can they run in
      'shared' mode?
   Q7. how many nodes and how much testing time do they require 
       a) in shared mode b) in exclusive mode?
   Q8. are there any specific requirements on system parameters or any constraints?

I've grouped the tests by whether they need Athena, and combined those that are common
for LVL2 and EF. It's a bit long due to those 8 questions, but I think
they give us a nice view of what we want to do.

- Haimo


Proposed HLT test plan for 2005 Large Scale Tests
=================================================

Name abbreviations:

AdA - Andre dos Anjos
AB  - Andre Bogaerts
HG  - Hegoi Garitaonandia
SS  - Serge Sushkov
HZ  - Haimo Zobernig
AN  - Andrea Negri
SK  - Sander Klous

Tests which don't depend on HLT selection algorithms:
-----------------------------------------------------

Test 1) SW distribution with BitTorrent 
Haimo coordinating the tests

  test distributing a "shrink-wrapped" sw distribution containing the 
  tdaq-01-01-00, offline 10.0.2 and hlt-02-00-00 releases, combined 
  into a single, pre-tested filesystem. This would be a compressed .iso 
  file of ca. 1.7 GB ( 5GB expanded ). 

  Q1: The method of distributing this to N nodes which we wish to test is 
  by using the peer-to-peer protocol BitTorrent. We have tested it with
  up to 50 nodes on Fast Ethernet. Maximum time observed was 30 mins to
  distribute 50*1.7 GB = 85 GB on 50 nodes. 
  (see https://uimon.cern.ch/twiki/bin/view/Atlas/HltSoftwareDownload ) 

  Q2: We would like to run this test in all the different sizes of the LST
  to get measured timing on its scaling behaviour. Expected scaling
  should be logarithmic with number of nodes, or better. But it will
  depend on other concurrent network activity.

  Q3: HZ AdA HG
  Q4: yes, already done on up to 50 machines
  Q5: yes, test can be run in automated way, eg by cron job
  Q6: for initial verification shared mode is ok, for scaling measurement
  exclusive access is preferred (largest configuration may need 1-2 hrs
  if log scaling applies, see Q2)

  Q7: the maximum available in each phase of LST (see Q2)
  Testing time initially a few hours if problems appear. BT itself is very
  well tested by the worldwide Internet community...

  Q8: requirements:
  a) a local disk area with at least 5 GB of disk space on every node
  b) ability to mount an .iso file as a read-only file system
  c) BitTorrent version 3.4.2 or higher. If tests are successfull we may
     want to use the new "trackerless" version 4.1.1 of BitTorrent, which 
     supports many thousands of simultaneous downloaders of the same file.
     This would be interesting at least for 500 or more nodes in the LST.

Test 2) verify this file system with known-to-work tdaq-only partition

  Q1, Q2: make sure larger partitions work, as tdaq-01-01-00 may behave 
  differently from tdaq-01-02-00
  Q3: HZ AdA AB  
  Q4: yes  Q5: yes  Q6: can run shared 
  Q7: some large partition should be tried, but not necessarily in all LST sizes
  Q8: no

Test 3) run w/ preloaded events, but dummy algorithms

  Q1, Q2: Learn behaviour of large partitions with preloaded events.
  Run a TDAQ partition w/ dummy algorithms, but with events preloaded
  in ROS (preferred, w/ ROSE as backup alternative)
  This test is also a prerequisite for test 6)
  Q3: HZ AdA AB 
  Q4: yes, done with small partitions in lab32
  Q5: yes, Q6: shared OK, fraction of cluster to be determined 
  Q7: same as Test 2)
  Q8: no

Tests with HLT sw (thus requiring tdaq-01-01-00, Offline 10.0.2, HLT-02-00-00):
-------------------------------------------------------------------------------

Test 4) HelloWorld and Level1 decoding test

  Q1: run algorithms 'HelloWorld' and 'HelloWorldLVL1'
  Both algorithms test much of the Athena infrastructure as used in HLT.
  Memory consumption, disk access patterns and shared library use will
  be studied. HelloWorld applies equally well for LVL2 and EF, while 
  HelloWorldLVL1 decodes the Level 1 result and is known to be working 
  - even in multi-threaded mode - and constitutes a LVL2-specific test

  Q2: verify use of Athena in LST for the first time.
  verify the LVL1 decoding part of the LVL2 trigger in multithreaded mode
  on large partitions. 
  Q3: HZ, AdA, AB, SS, HG, AN
  Q4: yes Q5: yes Q6: shared OK (part of cluster, not sharing same nodes)
  Q7: 50 to 100% of cluster (100% for a final test). Running time 5-30 minutes 
      per test 
  Q8: no

Tests which depend on availability of working algorithms:
---------------------------------------------------------
These tests must use tdaq-01-01-00 as Athena doesn't yet work with the new
tdaq release; all required sw will be distributed via the .iso file downloaded
in Test 1 - we assume this will work as tested in lab32.
The algorithms to be tested here may not be available during the first, smaller
phases of the LST. They will have a chance of getting distributed every time
Test 1 is run. 


Test 5) test HLT configure behaviour

  Q1: if any algorithm is available, that has been demonstrated to at least
  survive the 'configure' transition, we wish to test this transition
  in all the LST phases. This applies to LVL2 as well as EF.

  Q2: understand scaling problems with access to databases, loading shared
  libraries etc. Needs cooperation with DB management group due to large bursty
  load generated by the many DB clients.
  Q3: HZ, AdA, AB, SS, HG, AN
  Q4: yes  Q5: probably needs manual running to observe behaviour in real time
  Q6: shared mode OK 
  Q7: good fraction of cluster, maybe ~50%, probably not 100% (until everything
      is in good shape and under control)
  Q8: DB access may pose restrictions, but not yet known in detail
      Running time could be substantial if test is working at all, due to many
      parameters to be studied (many short runs). Details still need working out.

Test 6) test HLT run phase

  Q1: if any algorithm is available that is known to run on smaller testbeds
  in LVL2 and/or EF, we wish to test it also with all LST sizes,
  (using realistic, non-dummy events). Might be done separately for LVL2 and EF
  i.e. vertical slice probably only in the most optimistic of cases.

  Q2: study runtime behaviour of algorithms, look out for unexpected/forbidden
  database accesses during event processing. study resource consumption such as
  memory foot print (watch for leaks), network load, cpu usage etc. with many
  nodes..
  Q3: HZ, AdA, AB, SS, HG, AN
  Q4: yes  Q5: yes, but will also need some handholding...
  Q6: shared mode OK
  Q7: as Test 5), but running time may be substantial if all works in short runs
       (eg. long run on many nodes is a good way to find memory leaks and
        rare but serious problems)
  Q8: as Test 5)

Tests specific to EF requiring no Athena algorithms:
----------------------------------------------------
Andrea and Serge coordinating the tests
These tests should preferrably be done with the tdaq-01-02-00 release to 
benefit from recent improvements there. They can be scheduled independently
of the HLT tests requiring Athena.
note: the description of these tests has been moved to the section on sub-system tests.

Preparation tasks for the tests 
  
Software installation and administration
Gokhan and Matei responsible
Database generation tool
Gokhan  responsible
Tools 
All to contribute
Preparation and documentation of testing tools is on-going. 
The documentation of those tools will be published on the LST web page, 
a number of documented items can already be found there. 
A twiki page has been created to give easy access for making  
contributions, corrections and discussion. Stable items will be moved 
into the main LST page.
Areas concerned are:  
  • execution scripts
  • automatic execution scripts
  • analysis tools
  • farm management tools
  • debugging aid
  • How-to descriptions