GENIUS Troubleshooting

From RealityGrid

Jump to: navigation, search

[edit] HARC

Setting up HARC in CCS

HARC is installed in /usr/local/cct-harc-1.9.3. To use HARC, export the following variables (or add them to your .bash_profile)

export COG_INSTALL_PATH=/usr/local/cog-jglobus-1.4/
export HARC_INSTALL_PATH=/usr/local/cct-harc-1.9.3
export PATH=$HARC_INSTALL_PATH/bin:$PATH

If you've never used the cog kit before run the setup wizard:

$COG_INSTALL_PATH/bin/setup

Finally you need to create a file: ~/etc/harc.properties

Into which you put the project IDs (TG budget codes) and their respective machine names - have a look at the documentation in $HARC_INSTALL_PATH/harc-reserve.txt for details of what you need to put in this file. It also contains details of how to make a reservation.

The HARC documentation should be consulted prior to use. It is available in $HARC_INSTALL_PATH

Before a reservation can be made, generate a proxy certifciate with grid-proxy-init. To make a cross site reservation on the UK NGS, do the following:

harc-reserve -c ngs.oerc.ox.ac.uk/2 -c vidar.ngs.manchester.ac.uk/2 -s 11:50 -d 0:10

The reservation starts at 11:50am, lasts for 10 minutes, and reserves 4 processors on the Oxford and Manchester machines. If the reservation is successful, the reservation IDs will be displayed:

ngs.oerc.ox.ac.uk/R16645.ngs - got 4 CPUs, not 2
vidar.ngs.manchester.ac.uk/R16981.vidar - got 4 CPUs, not 2

When launching the corresponding job(s) into the reservation, the reservation ID must be added to the RSL file in a reservation_id element (see sample RSL files below).

To check the reservations:

harc-status -c ngs.oerc.ox.ac.uk/R16645.ngs -c vidar.ngs.manchester.ac.uk/R16981.vidar

To cancel the reservations:

harc-cancel -c ngs.oerc.ox.ac.uk/R16645.ngs -c vidar.ngs.manchester.ac.uk/R16981.vidar

HARC Troubleshooting

When making reservations with HARC, bear the following in mind:

  • When making reservations, ensure that they are slightly longer (~5 mins) than the runtime of the job, to allow time for an interation of the scheduler to start the job in the reservation.
  • Ensure you specify the job runtime in the RSL with maxWallTime.
  • Ensure the number of CPUs reserved tally with CPUs requested in RSL (in count or whatever) - especially if you are specifying a number of nodes in the RSL.
  • Once a reservation has been made you can submit your job. It will be held in the queue until the reservation starts.
  • If you see an entry similar to: ngs.leeds.ac.uk: /usr/bin/sudo: no passwd entry for -! then maybe you're not in the gridmap file on that machine. Get in touch with the machine admins.


[edit] GENIUS Site Troubleshooting

NGS

Single site jobs can be launched on into HARC reservations with something similar to the following RSL:

&( executable = "/bin/sleep" )
 ( arguments = "60" )
 ( stderr = "stderr.txt" )
 ( stdout = "stdout.txt" )
 ( environment = ( "NGSMODULES" "gm/2.0.8" ) )
 ( count = "2" )
 ( jobType = "mpi" ) 
 ( maxWallTime = "5" )
 ( reservation_id = "R16981.vidar")

To monitor your job and check that it is submitted to the reservation use the following command on the NGS site:

module load pbs
qstat | grep <username>
qstat -f job_id

It is also possible to submit jobs into HARC reservations via PBS. PBSPro (on the NGS machines) handles reservations by creating a special queue for the reservation. You can submit jobs to your reservations in the same way you would submit to any specific queue by adding a line to your submission script. Supposing your reservation is at Manchester and the reservation ID is R6513.vidar:

#!/bin/sh
#
#PBS -q R6513
#PBS -l nodes=1:ppn=2
#PBS -j oe
#PBS -N example

/bin/date

If you wish to avoid constantly editing the submission script, then you can specify the queue (and incidentally any other property) on the qsub command-line and it will override what is in the file:

qsub -q R6513 example.sh

Care must be taken when specifying combinations of nodes and processor count such that said combination exists in the reservation queue as some nodes have four processors and some have two.

If you want to check the status of your reservation (if harc-status is not working), then you can use the qstat -Q and qstat -q commands.

TeraGrid

For a full guide to performing co-allocated cross-site runs on the TeraGrid, see Guide to TeraGrid co-allocated cross-site runs.

NCSA

To monitor your job and check that it is submitted to the reservation use the following commands:

qstat -f job_id
checkjob job_id

SDSC

LONI

Before submitting to LONI machines make sure that you have the LONI CA certificate installed in your Globus grid-security/certificates folder.

Also, ensure that you are mapped in to a project at https://allocations.loni.org

To check that your job has gone in to the reservation:

llq -l l3f1n03.262.0 | grep Reservation

When submitting jobs to LONI, ensure that the following environment variable is set: GBLL_NETWORK_MPI sn_single,not_shared,US,HIGH

For example:

&( executable = "/bin/date" )
 ( stderr = "stderr.txt" )
 ( stdout = "stdout.txt" )
 ( environment = ( "GBLL_NETWORK_MPI" "sn_single,not_shared,US,HIGH" ) )
 ( count = "2" ) 
 ( jobType = "mpi" ) 
 ( maxWallTime = "5" )
 ( reservation_id = "l2f1n01.238.r")

To run a cross site run, RSL similar to the following can be used:

+
( &
(resourceManagerContact="zeke.loni.org/jobmanager-loadleveler")
(reservation_id = "l3f1n03.70.r")
(project = loni_loni_ahm07)
(job_type = multiple)
(count= 32)
(host_count=4)
(maxWallTime = 20)
(directory = /work/nfs301/sjzasada/hemelb-runs/)
(environment=
(GLOBUS_DUROC_SUBJOB_INDEX 0)
(GBLL_NETWORK_MPI sn_single,not_shared,US,HIGH)
(LD_LIBRARY_PATH
/usr/local/globus/globus.4.0.4/lib/:/usr/local/packages/mpig-64/lib)
(PATH
/usr/local/globus/globus-4.0.4/bin/:/usr/local/packages/mpig-64/bin:.)
)
(executable = /work/nfs301/mazzeo/hemelb/Code/hemelb.exe)
(arguments = /work/nfs301/sjzasada/hemelb-runs/angio1_input.asc)
(stderr=/work/nfs301/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.err)
(stdout=/work/nfs301/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.out)
)
(
&
(resourceManagerContact="ducky.loni.org/jobmanager-loadleveler")
(project = loni_loni_ahm07 )
(reservation_id = "l2f1n01.247.r")
(count= 32)
(host_count=4)
(maxWallTime = 20)
(directory = /work/nfs201/sjzasada/hemelb-runs/)
(environment=
(GLOBUS_DUROC_SUBJOB_INDEX 1)
(GBLL_NETWORK_MPI sn_single,not_shared,US,HIGH)
(LD_LIBRARY_PATH
/usr/local/globus-4.0.4/lib/:/usr/local/packages/mpig-64/lib)
(PATH /usr/local/globus-4.0.4/bin/:/usr/local/packages/mpig-64/bin:.) )
(executable = /work/nfs201/mazzeo/hemelb/Code/hemelb.exe)
(arguments = /work/nfs201/sjzasada/hemelb-runs/angio1_input.asc)
(stderr=/work/nfs201/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.err)
(stdout=/work/nfs201/sjzasada/hemelb-runs/angio1_32_DUCKY_32_ZEKE_MPIg.out)
)

For more details of running cross site runs on the LONI resources see https://docs.loni.org/wiki/Running_a_Job_Using_Multiple_Clusters_with_Globus_and_mpiG

Personal tools
projects