Guide to TeraGrid co-allocated cross-site runs

From RealityGrid

Jump to: navigation, search

Contents

[edit] Introduction.

Welcome to the quick and easy guide to getting a cross-site, co-reserved code running across multiple Grid resources.

In order to use MPIg and HARC to perform a cross-site, co-allocated run on the TeraGrid, you need the following:

  1. An install of the HARC client.
  2. A copy of MPIg (this is pre-installed on some of the machines).
  3. An MPI-enabled code.
  4. Two machines with the same architecture.

[edit] Building your code.

The MPIg web-site provides you with .soft files for sites that MPIg is installed on. Download the appropriate ones. If you have problems with the cross-site runs using code compiled with the Intel versions of the compilers, it might be worth trying the GCC ones as so far they've been more reliable. After putting the .soft files in your home directory, make sure to run "resoft" so that your shell picks them up.

If your code has been developed with another compiler, be aware that GCC can break some codes (like LAMMPS). If you get "incorrect" results on test runs, it's likely you need to drop the optimisation level you are compiling at (so try -O2 instead of -O3). All the libraries you use should be compiled with the same compiler, so if your code relies on FFTW for example, make sure you have access to an appropriate build of FFTW done with GCC. If your code is a third party code (i.e. it was developed by someone else) and it has problems with GCC, it'll likely say so in the documentation along with appropriate work-arounds (in the case of LAMMPS, a couple of files need to be compiled with -fno-strict-aliasing).

[edit] Constructing your RSL files.

The MPIg site has a number of example RSL files. When submitting a cross-site run, you submit each job as a seperate sub-job in a single RSL file to mpiexec (which wraps around globusrun). Each sub-job is numbered. Below is a sample RSL file:

+
( &(resourceManagerContact="grid-hg.ncsa.teragrid.org/jobmanager-pbs")
   (reservation_id="owain.1305")
   (host_types=ia64-compute)
   (host_xcount=4)
   (xcount=2)
   (count=8)
   (project="TG-DMR070054N")
   (jobtype=mpi)
   (maxwalltime=20)
   (label="subjob 0")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
   (directory="/home/ac/owain/Experiments/routine_test")
   (executable="/home/ac/owain/src/mpig_lammps/src/lmp_ncsa_MPIg_gcc")
   (arguments="-in" "/home/ac/owain/Experiments/routine_test/harc.xs.in" "-partition" "2x8")
)
( &(resourceManagerContact="tg-login.sdsc.teragrid.org/jobmanager-pbs_gcc_resid")
   (reservation_id="1191590147")
   (host_types=ia64-compute)
   (host_xcount=4)
   (xcount=2)
   (count=8)
   (project="TG-DMR070054N")
   (jobtype=mpi)
   (maxwalltime=20)
   (label="subjob 1")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
                (LD_LIBRARY_PATH "/usr/local/apps/globus-mpi-4.0.1-mpich-gm-gcc322-pthr-r1/lib/"))
   (directory="/users/okenway/Experiments/routine_test")
   (executable="/users/okenway/src/mpig_lammps/src/lmp_sdsc_MPIg_gcc")
   (arguments="-in" "/users/okenway/Experiments/routine_test/harc.xs.in" "-partition" "2x8")
)

This RSL file runs two jobs, one at NCSA and one at SDSC. Most of the entries should be familiar. The important things to notice (and remember) about this file are:

  1. host_xcount is the number of nodes, xcount is the number of processors per node and count is the total number of processors.
  2. Each sub-job needs to be numbered incrementally starting at zero, and both label and GLOBUS_DUROC_SUBJOB_INDEX need to be set correctly.
  3. The LD_LIBRARY_PATH line in the SDSC sub-job is due to the back-end and front-end having different environments meaning that it is necessary to manually tell jobs where to find the Globus libraries. This is not a problem that only occurs at SDSC - other sites have this problem as well. If your code doesn't start and you get a bunch of errors from ld telling you that it cannot find certain libaries, then do the following:
    1. Change directory to the place where your binary is.
    2. Type "ldd my_binary" where "my_binary" is the name of your binary. This will spew out a few pages of information telling you where it can find all the libraries your binary is linked to.
    3. Scroll through this (or use grep) until you find the ones ld is complaining about.
    4. Add the directory those libraries are in to the LD_LIBRARY_PATH in the RSL file in the same way it is added in the example - make sure your brackets are correct - it's *inside* the "environment" entry.
    Depending on the machine and libraries you are using, you may need to add multiple directories to the LD_LIBRARY_PATH.
  4. The appropriate values for reservation_id will be provided to you by HARC when you make a reservation.
  5. It is vitally important that you enclose the reservation ID in quotes otherwise your job may not run.
  6. The arguments line contains arguments for the binary of the job (in this instance -partition is a command-line option for LAMMPS and not necessary for all cross-site runs - just this particular problem).
  7. The resource manager at SDSC is different from the one you normally use at SDSC.

This RSL file may be run by typing "mpiexec --globusrsl=my.rsl" where "my.rsl" is the name of the RSL file. You can also run it with globusrun. Note that batch submission of multi-job RSL files is not currently supported by Globus, so you must do this interactively. The job will not actually start until the reservation starts.

[edit] Making reservations with HARC.

Once HARC is configured correctly, making reservations is relatively straightforward. A typical reservation is of the form:

harc-reserve -c grid-hg.ncsa.teragrid.org/32 -c tg-login2.sdsc.edu/32 -s 16:15 -d 1:00

This reserves 32 processors at NCSA and 32 processor at SDSC starting at 16:15 local time for a duration of one hour. If this is successful, HARC will return a pair of reservation IDs for you to insert into your RSL. If you want to watch what it's doing, add -debug to that command.

Some things to note:

  1. Make a reservation that's somewhat longer (at least five minutes) than the run-time you have requested in your RSL file. This will give the scheduler some time to cycle and start your job. Otherwise your job will never start for apparently no reason.
  2. Make your reservations considerably in advance as some batch systems will otherwise not start your jobs in a timely fashion. A safe period is one hour in advance, although depending on the site this could be less. In practice, due to the busy nature of some sites you probably will need to make your reservations considerably in advance anyway.
  3. Even though you've specified a start time, it's likely that some sites will take longer than others to start your code. Don't panic. If all the sites haven't started it within five minutes of the start time it's possible that something is wrong.
  4. If you want to cancel a reservation, use the harc-cancel command. You can check the status of your reservation with harc-status. Note that harc-status only tells you the status of your reservation according to HARC. The batch system on the machine you are using may not agree with this if there is a problem.
  5. The showres command at NCSA and show_res command at SDSC show the status of reservations on those machines.
  6. If you try to cancel a reservation with harc-cancel and HARC claims it doesn't know anything about the reservation, then it's likely that the RM at that site has been restarted. You can cancel the reservations with local commands at those sites. At SDSC, the command to cancel a reservation is user_cancel_res and at NCSA the command is releaseres.
  7. If the HARC client crashes, or you cancel it with Ctrl-C, this does not prevent a reservation being made, just the reservation IDs from being returned to you. You can find out the reservation IDs using the showres/show_res commands above.
  8. If a cross-site reservation fails (due to a lack of resource), you might still get a confirmation e-mail from SDSC for a reservation that doesn't exist. This is because HARC attempts to make the reservations at both sites and then cancels when one fails. It is possible in the time-frame of this happening for the scheduler at SDSC to notice that you have an upcoming reservation and e-mail you about it.

[edit] Testing your code.

Testing is vitally important in this environment. There are any number of things that can go wrong and because of the complexity of what we are doing, there are many points of failure.

A typical testing regime is as follows:

  1. Test each site with a single sub-job. This lets you know if you've built the code correctly and if it's basically working at each site. This is the stage at which you will pick up library problems like the ones mentioned above. For obvious reasons, you don't need to use reservations to do this.
  2. Run a job containing two subjobs at a single site (do this at each site). This makes sure MPIg is working properly at each site.
  3. If both of the above work, then attempt a full cross-site run. If this fails then it could be due to a number of problems from network problems to incompatible architectures.

Some things to note:

  1. At some sites, you won't get any output at all until the job has terminated. If it fails it is possible you won't get any output.
  2. Make sure to check the output at each stage to make sure it's sensible.
  3. Sometimes if there's a problem, your jobs won't terminate properly. You can use the PBS commands qstat and qdel to check the status of jobs and terminate them respectively. At NCSA, the checkjob command provides invaluable output in case of problems.
  4. If you don't get any output and you have a problem, it's possible the answer is in the many megabytes GRAM outputs into log files in your home directory. Note that the ID number in the GRAM log file name does not match the job ID so it's better to look at time-stamps.
Personal tools
projects