If you are using
Navigator 4.x
or
Internet Explorer 4.x
or
Omni Web 4.x
, this site will not render
correctly!
gfdl's home page > people > John Dunne >
Troubleshooting Guide
This document describes problems
I have encountered while executing MOM4 runscripts. Two general means
of troublshooting include:
Batch Job in Error State Failure during initialization
Failure during run
-
Run Interactively - If you are running in batch-mode, the easiest way
to start troubleshooting is by switching over to running
interactively, which may require reducing the number of processors
requested.
-
Totalview - A more thorough technique of debugging is to run the model within the Totalview
Debugger, which will allow you to explore the model code as it runs.
Batch Job in Error State Failure during initialization
Failure during run
-
Failure reading namelist
Failure reading tracer tree
Processor configuration incorrect
Restart file does not exist
Runtime exceeded
Batch Job in Error State
This is the scenario in which you have submitted a batch job, and
it just sits in the qeue without apparently executing the job at all.
Checking the qeue for pending, running and error states
Checking the qeue for pending, running and error states
-
An object-oriented qeue manager exists to help you keep track
of your batch (non-interactive) jobs:
qmonClicking on the upper, far left icon ("Job Control"; a ship's wheel) will allow you to see "Pending Jobs", "Running Jobs" and "Finished Jobs". You can also use qmon to cancel jobs and to check on the overall status of the computers.
-
A more direct method of viewing jobs is with:
qstat -u [USERNAME]which lists your current jobs.
-
In order to recieve an automatic email message whenever a batch
job crashes, create a file in your home directory called
.sge_request with the sole contents of:
-
-M [USERNAME]
-
When a runscript is submitted in batch mode, the number of
processors requested of the qeue manager in the header
(line three in my scripts):
#$ -pe lsc.alloc 12
must equal the number of processors assigned in the runscript:
set npes=12
-
When a runscript is submitted in batch mode, a "stdout"
filename pathway must be specified. This is done in the header
(line four in my scripts) as:
-
#$ -o /archive/[USERNAME]/jakarta/om1_ocmip2_biotic/1x0m1d_12pe/ascii/stdout
-
#$ /archive/[USERNAME]/jakarta/om1_ocmip2_biotic/1x0m1d_12pe/ascii/
Failure during initialization
-
Initialization is the part where the runscript
is creating folders and copying over the necessary files. Make
sure that all of these files exist.
-
A general approach to tracking down the source of errors during
runtime is to search for them in the code itself. The will do
this in a formal sense, but a quicker way is to change to the
directory in which the makefile for the given code resides, and:
make localize
This will copy all of the code used to create the executable to
the present working directory. Once all the code is there, a list
of the files and lines of code containing the
error message can be found with:
grep [error message] *90
This is generally a good place to start.Usually, this error is due to a syntax error in the namelist. Use the above grep procedure to find where the namelist variables are defined, and make sure that the variables in the runscript you are using are consistent with this list.
Failure reading tracer tree
-
The runscript concatenates tracer input files into five
namelist files corresponding to:
-
input.nml - options entering the tracer tree namelist
-
ocean_prog_tracer_tree_init - tracer packages to be turned on
and options to apply as defaults for all prognostic (advected
and diffused) tracers in that package
-
ocean_prog_tracer_tree - options to apply to individual
prognostic (advected and diffused) tracers
-
ocean_diag_tracer_tree_init - tracer packages to be turned on
and options to apply as defaults for all diagnostic tracers in that package
-
ocean_diag_tracer_tree - options to apply to individual
diagnostic tracers
more [FILENAME]-
The processor configuration is specified in a number of areas for
flexibility. The primary place that this is specified is near the
top of the runscript:
-
As described above, when running as a batch job, the number of
processors must be consistent with the number of
processors requested of the qeue manager in the header
(line three in my scripts):
#$ -pe lsc.alloc 12
-
When running with static allocation, the executable must have
been compiled with the same layout
that the runscript attempts to use. While the runscript may
fortuitously default to match, it is best to make this explicit
by assuring that the specifications in the layout are consistent with the dimensions
specified at the top of the Makefile with:
CPPDEFS = -Duse_netCDF -Duse_libMPI -DSTATIC_MEMORY -DNI_=96 -DNJ_=40 -DNK_=24 -DNI_LOCAL_=16 -DNJ_LOCAL_=20 -DNUM_PROG_TRACERS_=7 -DNUM_DIAG_TRACERS_=1
where DNI / DNI_LOCAL = number of domains (processors) in the x direction (96 / 16 = 6) and DNJ / DNJ_LOCAL = number of domains (processors) in the y direction (40 / 20 = 2).
set npes=12
An inconsistency with this setting can cause an error under the following circumstances:
-
If the namelist "ocean_model_nml" contains the option:
layout=6,2
then these two numbers must multiply together to equal the number of processors (6 x 2 = 12).
-
The first time a runscript is called, it creates an empty file
called initialized which allows the runscript to determine
whether to use the "initial condition" or "restart" tracer_tree
files. If the runscript has been attempted before, but did not get to the
end of the run, then the run will crash, as the file initialized will
have been created without the restarts. The solution to this
problem is to always delete the initialized file when
resubmitting a runscript for it's initial run.
-
When running interactively, it's also a good idea to always
delete your temporary directory () to make sure
that the program utilizes the intended files... otherwise, some
files, such as initialized and restarts, may linger,
causing problems.
-
Check tracer tree files to assure that the file names and variable
names are correct.
-
Note that there are time limits imposed on both interactive and
batch jobs:
|
computer cluster ac interactive lsc interactive lsc batch |
time limit 8 hours 30 minutes 8 hours |
memoryuse 4 GBytes 1 GByte 512 MBytes |
processor limit 16 (soft) 124 500 |
-
If the model is running slower than expected, it is usually one of
four things:
-
If memoryuse (see table above) exceeds the available RAM
for a given processor - When this happens, memory will be
borrowed from other processors, which can slow
the model to a stand-still. The diagnostic memoryuse is
printed out in the file [RUNINFO]fms.out
within the ascii directory.
-
If diagnostics are being calculated and printed out very
frequently on a longer run.
This can be checked by searching for instances of "diag_freq"
and "output_interval" in the
namelists. There are 5 instances of "diag_freq" in my current
runscripts. If "diag_freq" is set to a small
number (i.e. 1) reset it to a big number (i.e. 1000). You can
check how much time the model is spending on diagnostics (and
other components) by scanning through an extensive diagnostic table printed
out at the end of the run and held in the file [RUNINFO]fms.out
within the ascii directory for timings, in seconds, for
"Total runtime" and "Ocean numerical diagnostics" (among many
other components). Diagnostics can slow the model down by a factor of 10.
-
If the model has been compiled with the debug option - for
example, to be run in the
Totalview Debugger - then it will slow down the model by
approximately a factor of 5.
-
If the model has been compiled with dynamic allocation, then it
will slow down the model by approximately a factor of 2
relative to static allocation.
