NOAA

Geophysical Fluid
Dynamics Laboratory

Skip to: [content] [navigation]
If you are using Navigator 4.x or Internet Explorer 4.x or Omni Web 4.x , this site will not render correctly!

gfdl's home page > people > John Dunne >

Troubleshooting Guide

Description

This document describes problems I have encountered while executing MOM4 runscripts. Two general means of troublshooting include:
    Run Interactively - If you are running in batch-mode, the easiest way to start troubleshooting is by switching over to running interactively, which may require reducing the number of processors requested.
    Totalview - A more thorough technique of debugging is to run the model within the Totalview Debugger, which will allow you to explore the model code as it runs.
Below are some common problems and methods for solving them, listed in the order that they occur during execution...

Batch Job in Error State Failure during initialization

Failure during run



Batch Job in Error State

This is the scenario in which you have submitted a batch job, and it just sits in the qeue without apparently executing the job at all.

Checking the qeue for pending, running and error states
  • An object-oriented qeue manager exists to help you keep track of your batch (non-interactive) jobs:
    qmon Clicking on the upper, far left icon ("Job Control"; a ship's wheel) will allow you to see "Pending Jobs", "Running Jobs" and "Finished Jobs". You can also use qmon to cancel jobs and to check on the overall status of the computers.

  • A more direct method of viewing jobs is with:
    qstat -u [USERNAME] which lists your current jobs.

  • In order to recieve an automatic email message whenever a batch job crashes, create a file in your home directory called .sge_request with the sole contents of:

      -M [USERNAME]

    When the program crashes, an email message will be sent to you which includes the start and end time (so you can easily check if the run hit the max time), as well as some possibly helpful info.
Common reasons for an error state
  • When a runscript is submitted in batch mode, the number of processors requested of the qeue manager in the header (line three in my scripts): #$ -pe lsc.alloc 12

    must equal the number of processors assigned in the runscript:

    set npes=12

  • When a runscript is submitted in batch mode, a "stdout" filename pathway must be specified. This is done in the header (line four in my scripts) as:

      #$ -o /archive/[USERNAME]/jakarta/om1_ocmip2_biotic/1x0m1d_12pe/ascii/stdout

    If this pathway does not exist, the runscript will crash without generating any output... make sure that the directory:

      #$ /archive/[USERNAME]/jakarta/om1_ocmip2_biotic/1x0m1d_12pe/ascii/

    exists.

Failure during initialization
    Initialization is the part where the runscript is creating folders and copying over the necessary files. Make sure that all of these files exist.

Failure during run
    A general approach to tracking down the source of errors during runtime is to search for them in the code itself. The will do this in a formal sense, but a quicker way is to change to the directory in which the makefile for the given code resides, and:
    make localize This will copy all of the code used to create the executable to the present working directory. Once all the code is there, a list of the files and lines of code containing the error message can be found with: grep [error message] *90 This is generally a good place to start.

Failure reading namelist

Usually, this error is due to a syntax error in the namelist. Use the above grep procedure to find where the namelist variables are defined, and make sure that the variables in the runscript you are using are consistent with this list.

Failure reading tracer tree
    The runscript concatenates tracer input files into five namelist files corresponding to:

    • input.nml - options entering the tracer tree namelist
    • ocean_prog_tracer_tree_init - tracer packages to be turned on and options to apply as defaults for all prognostic (advected and diffused) tracers in that package
    • ocean_prog_tracer_tree - options to apply to individual prognostic (advected and diffused) tracers
    • ocean_diag_tracer_tree_init - tracer packages to be turned on and options to apply as defaults for all diagnostic tracers in that package
    • ocean_diag_tracer_tree - options to apply to individual diagnostic tracers

    and copy them to , the working directory for the particular model run. To check that this is working as expected, cd to and check these files with: more [FILENAME]
Processor configuration incorrect

    The processor configuration is specified in a number of areas for flexibility. The primary place that this is specified is near the top of the runscript:

    set npes=12

    An inconsistency with this setting can cause an error under the following circumstances:

      If the namelist "ocean_model_nml" contains the option:

      layout=6,2

      then these two numbers must multiply together to equal the number of processors (6 x 2 = 12).
    • As described above, when running as a batch job, the number of processors must be consistent with the number of processors requested of the qeue manager in the header (line three in my scripts):

      #$ -pe lsc.alloc 12

    • When running with static allocation, the executable must have been compiled with the same layout that the runscript attempts to use. While the runscript may fortuitously default to match, it is best to make this explicit by assuring that the specifications in the layout are consistent with the dimensions specified at the top of the Makefile with:

      CPPDEFS = -Duse_netCDF -Duse_libMPI -DSTATIC_MEMORY -DNI_=96 -DNJ_=40 -DNK_=24 -DNI_LOCAL_=16 -DNJ_LOCAL_=20 -DNUM_PROG_TRACERS_=7 -DNUM_DIAG_TRACERS_=1

      where DNI / DNI_LOCAL = number of domains (processors) in the x direction (96 / 16 = 6) and DNJ / DNJ_LOCAL = number of domains (processors) in the y direction (40 / 20 = 2).
Restart file does not exist

  • The first time a runscript is called, it creates an empty file called initialized which allows the runscript to determine whether to use the "initial condition" or "restart" tracer_tree files. If the runscript has been attempted before, but did not get to the end of the run, then the run will crash, as the file initialized will have been created without the restarts. The solution to this problem is to always delete the initialized file when resubmitting a runscript for it's initial run.

  • When running interactively, it's also a good idea to always delete your temporary directory () to make sure that the program utilizes the intended files... otherwise, some files, such as initialized and restarts, may linger, causing problems.

  • Check tracer tree files to assure that the file names and variable names are correct.

Runtime exceeded

    Note that there are time limits imposed on both interactive and batch jobs:

    computer cluster

    ac interactive
    lsc interactive
    lsc batch
    time limit

    8 hours
    30 minutes
    8 hours
    memoryuse

    4 GBytes
    1 GByte
    512 MBytes
    processor limit

    16 (soft)
    124
    500

    If the model is running slower than expected, it is usually one of four things:

    • If memoryuse (see table above) exceeds the available RAM for a given processor - When this happens, memory will be borrowed from other processors, which can slow the model to a stand-still. The diagnostic memoryuse is printed out in the file [RUNINFO]fms.out within the ascii directory.

    • If diagnostics are being calculated and printed out very frequently on a longer run. This can be checked by searching for instances of "diag_freq" and "output_interval" in the namelists. There are 5 instances of "diag_freq" in my current runscripts. If "diag_freq" is set to a small number (i.e. 1) reset it to a big number (i.e. 1000). You can check how much time the model is spending on diagnostics (and other components) by scanning through an extensive diagnostic table printed out at the end of the run and held in the file [RUNINFO]fms.out within the ascii directory for timings, in seconds, for "Total runtime" and "Ocean numerical diagnostics" (among many other components). Diagnostics can slow the model down by a factor of 10.

    • If the model has been compiled with the debug option - for example, to be run in the Totalview Debugger - then it will slow down the model by approximately a factor of 5.

    • If the model has been compiled with dynamic allocation, then it will slow down the model by approximately a factor of 2 relative to static allocation.




smaller bigger reset
last modified:February 11 2004.