GENIUS Troubleshooting
From RealityGrid
[edit] HARC
Setting up HARC in CCS
HARC is installed in /usr/local/cct-harc-1.9.3. To use HARC, export the following variables (or add them to your .bash_profile)
export COG_INSTALL_PATH=/usr/local/cog-jglobus-1.4/ export HARC_INSTALL_PATH=/usr/local/cct-harc-1.9.3 export PATH=$HARC_INSTALL_PATH/bin:$PATH
If you've never used the cog kit before run the setup wizard:
$COG_INSTALL_PATH/bin/setup
Finally you need to create a file: ~/etc/harc.properties
Into which you put the project IDs (TG budget codes) and their respective machine names - have a look at the documentation in $HARC_INSTALL_PATH/harc-reserve.txt for details of what you need to put in this file. It also contains details of how to make a reservation.
The HARC documentation should be consulted prior to use. It is available in $HARC_INSTALL_PATH
Before a reservation can be made, generate a proxy certifciate with grid-proxy-init. To make a cross site reservation on the UK NGS, do the following:
harc-reserve -c ngs.oerc.ox.ac.uk/2 -c vidar.ngs.manchester.ac.uk/2 -s 11:50 -d 0:10
The reservation starts at 11:50am, lasts for 10 minutes, and reserves 4 processors on the Oxford and Manchester machines. If the reservation is successful, the reservation IDs will be displayed:
ngs.oerc.ox.ac.uk/R16645.ngs - got 4 CPUs, not 2 vidar.ngs.manchester.ac.uk/R16981.vidar - got 4 CPUs, not 2
When launching the corresponding job(s) into the reservation, the reservation ID must be added to the RSL file in a reservation_id element (see sample RSL files below).
To check the reservations:
harc-status -c ngs.oerc.ox.ac.uk/R16645.ngs -c vidar.ngs.manchester.ac.uk/R16981.vidar
To cancel the reservations:
harc-cancel -c ngs.oerc.ox.ac.uk/R16645.ngs -c vidar.ngs.manchester.ac.uk/R16981.vidar
HARC Troubleshooting
When making reservations with HARC, bear the following in mind:
- When making reservations, ensure that they are slightly longer (~5 mins) than the runtime of the job, to allow time for an interation of the scheduler to start the job in the reservation.
- Ensure you specify the job runtime in the RSL with maxWallTime.
- Ensure the number of CPUs reserved tally with CPUs requested in RSL (in count or whatever) - especially if you are specifying a number of nodes in the RSL.
- Once a reservation has been made you can submit your job. It will be held in the queue until the reservation starts.
- If you see an entry similar to: ngs.leeds.ac.uk: /usr/bin/sudo: no passwd entry for -! then maybe you're not in the gridmap file on that machine. Get in touch with the machine admins.
[edit] GENIUS Site Troubleshooting
NGS
Single site jobs can be launched on into HARC reservations with something similar to the following RSL:
&( executable = "/bin/sleep" ) ( arguments = "60" ) ( stderr = "stderr.txt" ) ( stdout = "stdout.txt" ) ( environment = ( "NGSMODULES" "gm/2.0.8" ) ) ( count = "2" ) ( jobType = "mpi" ) ( maxWallTime = "5" ) ( reservation_id = "R16981.vidar")
To monitor your job and check that it is submitted to the reservation use the following command on the NGS site:
module load pbs qstat | grep <username> qstat -f job_id
It is also possible to submit jobs into HARC reservations via PBS. PBSPro (on the NGS machines) handles reservations by creating a special queue for the reservation. You can submit jobs to your reservations in the same way you would submit to any specific queue by adding a line to your submission script. Supposing your reservation is at Manchester and the reservation ID is R6513.vidar:
#!/bin/sh # #PBS -q R6513 #PBS -l nodes=1:ppn=2 #PBS -j oe #PBS -N example /bin/date
If you wish to avoid constantly editing the submission script, then you can specify the queue (and incidentally any other property) on the qsub command-line and it will override what is in the file:
qsub -q R6513 example.sh
Care must be taken when specifying combinations of nodes and processor count such that said combination exists in the reservation queue as some nodes have four processors and some have two.
If you want to check the status of your reservation (if harc-status is not working), then you can use the qstat -Q and qstat -q commands.
TeraGrid
For a full guide to performing co-allocated cross-site runs on the TeraGrid, see Guide to TeraGrid co-allocated cross-site runs.
NCSA
To monitor your job and check that it is submitted to the reservation use the following commands:
qstat -f job_id checkjob job_id
SDSC
LONI
Before submitting to LONI machines make sure that you have the LONI CA certificate installed in your Globus grid-security/certificates folder.
Also, ensure that you are mapped in to a project at https://allocations.loni.org
To check that your job has gone in to the reservation:
llq -l l3f1n03.262.0 | grep Reservation
When submitting jobs to LONI, ensure that the following environment variable is set: GBLL_NETWORK_MPI sn_single,not_shared,US,HIGH
For example:
&( executable = "/bin/date" ) ( stderr = "stderr.txt" ) ( stdout = "stdout.txt" ) ( environment = ( "GBLL_NETWORK_MPI" "sn_single,not_shared,US,HIGH" ) ) ( count = "2" ) ( jobType = "mpi" ) ( maxWallTime = "5" ) ( reservation_id = "l2f1n01.238.r")
To run a cross site run, RSL similar to the following can be used:
+ ( & (resourceManagerContact="zeke.loni.org/jobmanager-loadleveler") (reservation_id = "l3f1n03.70.r") (project = loni_loni_ahm07) (job_type = multiple) (count= 32) (host_count=4) (maxWallTime = 20) (directory = /work/nfs301/sjzasada/hemelb-runs/) (environment= (GLOBUS_DUROC_SUBJOB_INDEX 0) (GBLL_NETWORK_MPI sn_single,not_shared,US,HIGH) (LD_LIBRARY_PATH /usr/local/globus/globus.4.0.4/lib/:/usr/local/packages/mpig-64/lib) (PATH /usr/local/globus/globus-4.0.4/bin/:/usr/local/packages/mpig-64/bin:.) ) (executable = /work/nfs301/mazzeo/hemelb/Code/hemelb.exe) (arguments = /work/nfs301/sjzasada/hemelb-runs/angio1_input.asc) (stderr=/work/nfs301/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.err) (stdout=/work/nfs301/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.out) ) ( & (resourceManagerContact="ducky.loni.org/jobmanager-loadleveler") (project = loni_loni_ahm07 ) (reservation_id = "l2f1n01.247.r") (count= 32) (host_count=4) (maxWallTime = 20) (directory = /work/nfs201/sjzasada/hemelb-runs/) (environment= (GLOBUS_DUROC_SUBJOB_INDEX 1) (GBLL_NETWORK_MPI sn_single,not_shared,US,HIGH) (LD_LIBRARY_PATH /usr/local/globus-4.0.4/lib/:/usr/local/packages/mpig-64/lib) (PATH /usr/local/globus-4.0.4/bin/:/usr/local/packages/mpig-64/bin:.) ) (executable = /work/nfs201/mazzeo/hemelb/Code/hemelb.exe) (arguments = /work/nfs201/sjzasada/hemelb-runs/angio1_input.asc) (stderr=/work/nfs201/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.err) (stdout=/work/nfs201/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.out) )
For more details of running cross site runs on the LONI resources see https://docs.loni.org/wiki/Running_a_Job_Using_Multiple_Clusters_with_Globus_and_mpiG
