----------------
1. s4 only :
----------------
/home (s4) : only seen on s4
/scratch[1-6] (s4) :
[dxu@s4-gateway /]$ lr |grep scra
drwxr-xr-x 2 root root 4096 May 16 2011 scratch
drwxr-xr-x 2 root root 4096 Aug 12 2011 scratch1
drwxrwxr-x 2 root root 4096 May 7 2012 scratch5
---------------------
2. badger only :
---------------------
/home (badger) : # src
/scratch[1-6] (badger) : # output data during execution
[dxu@s4-badger /]$ lr |grep scra
drwxrwxrwt 13 root domain users 4096 Jan 24 17:02 scratch5
drwxrwxrwt 8 root root 4096 Feb 2 20:43 scratch4
drwxrwxrwt 10 root root 4096 Feb 11 17:30 scratch2
drwxrwxrwt 6 root root 4096 Feb 11 17:34 scratch1
--------------------------------------------------------------------
3. Entire cluster: accessible from both s4 and badger
--------------------------------------------------------------------
/data : # store output data
/worktmp : # temporary local working
---------------------------------
4. Local file to / from s4:
---------------------------------
4.1 Local ==> s4
$ rsync -av ~/abc dxu@s4.ssec.wisc.edu:~/
$ scp -r ~/abc dxu@s4.ssec.wisc.edu:~/
4.2 Local <== s4
$ rsync -av dxu@s4.ssec.wisc.edu:/data/dxu/abc ~/
$ scp -r dxu@s4.ssec.wisc.edu:/data/dxu/abc ~/
---------------------------------
5. Submit jobs on badger:
---------------------------------
5.1 Resources : 8TB RAM, 64 nodes, 3,072 cores
1 node = 48 Cores ( Core = Slot = CPU )
64 nodes = 3,072 cores
RAM : 8,000 GB / 64 = 125 GB / Node
RAM : 125 GB / 48 = 2.6 GB / Core
5. 2 Resource limit:
max_run_time = 6 hrs
default_mem_usage = 2.52 GB
3 TB quota data / user
128GB RAM / Node
3-min waiting time in queue
5. 3 Serial job ( small job )
$ qsub -N MyJobName script.bash
$ qsub -N MyJobName -l vf=3G script.bash # ask for 3GB memory
Flags in scripts: (#$ : flags)
#!/bin/bash
#
#$ -q serial.q # send job to the serial.q
#$ -N myjobname # job name
#$ -e $HOME/output #
#$ -o $HOME/output # output to $HOME directory
5. 4 Parallel jobs
1. By default, "exclusivity" is on, and each node can only run one job.
2. "-l excl = false " to disable "exclusivity" so multiple jobs can share the same node.
$ qsub -pe mpi2_mpd 12 -l excl=false myjob.sh
5.4.1 MPI job ( big job )
a) submit a job
# pe (paralllel environment) : -pe mpi2_mpd 192
# job name: -N MyJob
$ qsub -pe mpi2_mpd 192 -N MyJob example.bash
b) sample script : example.bash
# 1. set up ENV vars.
source /etc/bashrc
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
module load bundle/basic-1
# 2. set up working directory.
module load jobvars
WORKDIR=$TMPDIR_SHORT
# 3. run my fortran code.
mkdir -p $WORKDIR
mpiexec -machinefile $TMPDIR/machines -n 196 my_fortran_code.exe
# 4. clean up working directory
rm -rf $WORKDIR
5.4.2 SMP job ( medium job )
a) Submit a SMP job
$ qsub -pe smp 48 -N MyJob script.bash
b) sample script:
#!/bin/bash5. 5 Monitor jobs
# set output directory
#$ -o $HOME/output
#$ -e $HOME/output
#specify to run the job in bash shell (optional for bash)
#$ -S /bin/bash
#source .bashrc for basic environment, including module support
source /etc/bashrc
#load jobvars module
# -this provides $TMPDIR_SHORT, $TMPDIR_LONG, and $SCRATCH
module load jobvars
# TMPDIR_SHORT is recommended working directory, cleaned# every 1 to 2 days
WORKDIR=$TMPDIR_SHORT
# make the working directory
mkdir -p $WORKDIR
echo "Working directory: $WORKDIR"
# Set up input, output and executable variables
# these often differ per job
INPUT=/data/$USER/myprogram/input
RESULTS=/data/$USER/$JOB_NAME.$JOB_ID
EXECUTABLE=$HOME/programs/myprogram.exe
#set up for MPI
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
#load modules
module load bundle/basic-1
# Do our work in the high speed scratch space
cd $WORKDIR
# copy your input to your $WORKDIR
rsync -a $INPUT/* $WORKDIR
# assuming your executable is /home/$USER/program/executable run this:
mpiexec -machinefile $TMPDIR/machines -n $NSLOTS $EXECUTABLE
# copy your results to a directory in /data/$USER
rsync -a <output_files> $RESULTS
#clean up your working directory
# -if you need to debug, comment this out, and clean up manually after examining
cd #change out of $WORKDIR
rm -rf $WORKDIR
# list all the job status (running or pending)
$ qstat | grep dxu
$ qstat -f | grep dxu
$ qstat -u dxu
$ qstat -j your_job_id
$ qstat -xml
$ qacct -j your_job_id # info about a completed jobs
$ qresub # resubmit a job
$ qdel # delete a job
No comments:
Post a Comment