Tue, 11 Dec 2007

FRE: hsmfiles

<div class="www"> This item is GFDL-internal only: if you are authorized and set up to have access to cobweb, change www to cobweb in the URL bar above. </div> <div class="cobweb"> FRE was originally conceived with a 2-layer storage model, there was \$TMPDIR and there was \$ARCHIVE. The former was local disk attached to the compute node, and only lasted for the duration of the job; the latter was connected to tape and "deep" (linear) storage.

Depending on the filesystem and archive configuration and performance, there may be different ways to achieve this. Over time we have struggled with multiple optimization targets:

minimize file transfer time
minimize file count (i.e inode count)
minimize movement of "intermediate" files
minimize redundant copies of the same data
optimize file size for tape storage
minimize total archive size (eliminate redundant copies)

At various times, we have privileged one or the other of these optimization targets. We have tried to achieve an "optimal" file size (100Gb, as advised in 2004... is that still useful?) by using cpio to make file archives. We once did mppnccombine "online" (i.e within the runscript) to minimize the amount of intermediate data that was archived; later we moved it into a separate job because it mppnccombine is not parallelizable to O(100) CPUs, and was taking too much time; later we moved it back in because file transfers were taking too long.

For a while now we have been proposing a 3-layer storage model: the third layer, called ptmp, is a fast scratch area for storing intermediate files. We propose to use this to minimize archive use, and for fast access to "intermediate" files from multiple jobs. ptmp could be a shared filesystem (like /ptmp or /work), or might be in local disk (vftmp: in this case all the jobs sharing this data have to run on the same IC node...).

Tim proposes this high-level design for a tool called hsmfiles to manage data transfers on a 3-level storage model:

FILESYSTEMS

/vftmp    4 TB/host
/ptmp    44 TB
/work    22 TB
DCM     260 TB (/archive nearline disk = 4 x 65 TB)

files > 500 MB do not go to DCM. revise this?


FILESYSTEM PERFORMANCE

dd bs=16M of 4 GB from /dev/zero to:
  /vftmp  976 MB/s
  /ptmp   625 MB/s
  /arch2  423 MB/s
  /arch3  115 MB/s

dd bs=16M of 4 GB from /vftmp file to:
  /vftmp  192 MB/s
  /ptmp   166 MB/s
  /arch2  147 MB/s
  /arch3  110 MB/s

run on ic4 in cpuset of 4 contiguous cpus
/arch2 is nearly idle
/vftmp needs pre-allocation, bigger allocation units?


HSMFILES DESIGN

storage levels
  /archive, "disk", and "workdir"
  "disk" is /ptmp now, could change to /archive nearline disk (DCM)
  production job puts to /ptmp, post-processing job puts to /archive

/ptmp and /archive use same directory structure
  hsmfiles does mkdir -p in /ptmp and /archive as needed

container types
  directory, heirloom cpio -K file, gnu tar file
  default = directory

non-blocking parallel copies
  default for -put to directory container
  in future, could extend to cpio/tar containers

hsmfiles -get -link          allows reading from /ptmp
hsmfiles -get -workdir=path  allows /ptmp scratch dir

in future, could leave files in /vftmp/hsmfiles
  hand off to file transfer daemon
  reload job makes soft request for same exec host
  if host crashes, files are stranded


HSMFILES OPTIONS

"do what i mean semantics" throughout

get destination/put source is:
  -workdir if specified
  \$cwd, if a subdirectory of /vftmp
  otherwise, \$TMPDIR

USAGE:
  hsmfiles operation [options] [container_type] [container_path] [filelist]

operation
  -get              get from /ptmp if there, otherwise from /archive
  -get=archive      get from /archive only
  -get=stage        copy /archive files to /ptmp

  -put              put to /ptmp
  -put=archive      put to /archive
  -put=migrate      put /ptmp files to /archive

  -wait             with -put, do blocking copies
                    alone, wait for job's non-blocking copies to complete

  -list             ls -l, cpio -itv, tar -tv

options
  -workdir=name     workdir = subdir of \$TMPDIR
  -workdir=path     workdir = arbitrary pathname
  -link             symlink to destination
  -keep             don't remove source files after put

  -append           append to container
  -strict           exit if any files in filelist are missing
  -dryrun           just print what would be done

container_type
  -dir              directory (default)
  -cpio             heirloom cpio -K file
  -tar              gnu tar file

container_path
  .cpio/.tar extension added/removed as needed
  if omitted for -dir, use \$cwd if in /archive or /ptmp

filelist
  basename[=workdirname] ...  (no paths, no patterns)
  if omitted and -get, get all files in container


INTERNAL UTILITIES

hsmget, hsmput
  run multiple hsmcp's in parallel
    in cpuset = private /sge/jobid.1/hsmfiles? shared /sge/hsmfiles?
  non-blocking (default) or blocking
  process name or arguments includes jobid
    hsmfiles -wait uses ps to check non-blocking processes

hsmcp
  16 MB I/O's
  cxfscp or custom?

hsmcpio, hsmtar
  custom heirloom cpio and gnu tar
  seek, instead of reading thru data, for -A, -t, -x
    would this allow parallel invocations?

hsmtype
  return container type

hsmmigrate
  if /ptmp fills, migrates /ptmp to /archive
  may use xfs extended attributes


RELATED USER COMMANDS

/usr/local/OSoverlay/bin/cpio
  enforces blocksize = n*64KB
  adds -K

/usr/local/OSoverlay/bin/tar (planned)
  gnu tar with mods, for use outside hsmfiles
  default blocksize = 64KB for write, 512KB for read
  enforce blocksize = n*64KB

</div>

Mon, 3 Dec 2007

FRE: HPCS data transfer speeds

Depending on the filesystem and archive configuration and performance, there may be different ways to achieve this. Over time we have struggled with multiple optimization targets:

minimize file transfer time
minimize file count (i.e inode count)
minimize movement of "intermediate" files
minimize redundant copies of the same data
optimize file size for tape storage
minimize total archive size (eliminate redundant copies)

There are some questions that arise:

is it worthwhile making archive files? If it's the archive creation and extraction itself that's taking time, perhaps it's better to leave files as they are.
if we are making archive files, should we continue to use cpio, or should be use something else, e.g tar? (note that we aren't using GNU/linux standard cpio, but a version that's been locally hacked for performance. We are very likely moving toward multiple computing sites: one requirement must be that any archive files created must be portable to systems without custom tools).
where should intermediate files be stored? (in /vftmp itself, /ptmp or /work, or maybe /archive? (the last option is the idea of making a large filesystem on /archive, but marking "intermediate" files in a way that prevents them from being written to tape).
what's the fastest way to transfer files to various destinations? cp, rcp, rsync, cxfscp? Should we make archives and stream single large files, or stream files as they are (using the -r option that all of the above support? (The reason I include rsync is that it seems to be clever about interrupted or double transfers, and only transfers the needed increments).

I've done some measurements, which may be useful here.

<span> The script I used it /home/vb/bin/cpspeed.csh. Please feel free to use it if you feel the urge. </span>

The test assumes you're in the middle of a running production job, and have a top-level directory that contains a number of files.

There are 3 source directories, of size 3 GB (small, called INPUT), 25 GB (medium, called om3p25) and 155 GB (large, called A19_history).
There are 3 destinations: /vftmp, /work and /archive.
There are 10 transports: cp, rcp, rsync and cxfscp; the same but with the -r option; and then cpio and tar directly to the destination. The four with the -r option have the additional advantage that no time is need for packing/unpacking the archive.
the tests have been run on various nodes.

The results so far are very disappointing. So far no method anywhere has yielded within an order of magnitude of the expected bandwidth, which should be in the single-digit GB/sec. Is there any basic procedural flaw in the tests?

host	size	target	`cp`	`cp -r`	`rcp`	`rcp -r`	`rsync`	`rsync -r`	`cxfscp`	`cxfscp -r`	`cpio`	`tar`
ic10	3 GB	`/vftmp`	6.50	7.15	7.57	7.94	30.91	44.50	45.20	33.03	46.87	41.20
ic10	3 GB	`/work`	37.70	23.77	149.83	153.53	67.53	34.17	22.29	21.17	59.05	40.85
ic10	3 GB	`/archive`	90.19	60.34	402.67	482.44	455.43	242.62	94.44	92.50	270.28	365.99
ic10	25 GB	`/vftmp`	61.08	88.49	69.87	70.99	247.37	286.38	164.68	189.78	326.38	182.93
ic10	25 GB	`/work`	420.17	138.81	1154.92	1123.40	756.88 \|325.64	118.50	120.73	205.33	256.84
ic10	25 GB	`/archive`	573.23	722.37	3582.61	3198.38 \|1589.06	2143.48	582.37	694.26	1566.29	1008.74
ic10	166 GB	`/vftmp`	3226.53	1557.12	1444.46	1350.96	2880.95	2738.83	713.86	565.55	1543.45	1959.81
ic10	166 GB	`/work`	1394.47	1650.04	7537.16	778..55	3362.38	3048.66	1000.01	1150.76	1831.88	1953.08
ic10	166 GB	`/archive`	2890.00	2564.34	14717.04	32917.24	20611.61	13848.50	4189.66	5133.96	13514.89

</div>

Wed, 14 Nov 2007

gridspec: second version of gridspec-tools released.

A second version of the gridspec tools has been released, at this starting point. This now includes a full example of coupled model grid creation.

Mon, 12 Nov 2007

gridspec: first version of gridspec-tools released.

A first version of the gridspec tools has been released, at this starting point.

Tue, 6 Nov 2007

linux: dual computer, single keyboard and mouse.

Thanks to Chan Wilson, I can now use a single keyboard and mouse when my desktop and laptop sit side-by-side. I make my desktop machine run a VNC server that allows another machine to control it; and I let my laptop's keyboard and mouse control it. This has to be done over the daisy tunnel.

Download x11vnc for the desktop; and x2vnc for the laptop.
Set up the daisy tunnel to do this port-forwarding;

LocalForward 5910 vb.gfdl.noaa.gov:10001

Create and store (encrypted) a password for this connection.

vncpasswd
x11vnc -storepasswd

Copy this passwd (which will be in \$HOME/.vnc/passwd to the same location on your laptop.
Start the desktop VNC server listening on the 10001 port.

x11vnc -nofb -rfbport 10001 -usepw &

Start the laptop VNC client sending on local port 5910 (the number on the command line is the port you listed in the tunnel, minus 5900, don't ask...)

x2vnc -passwdfile ~/.vnc/passwd -west localhost:10 &

The west means that when you slide off the west edge ofyour laptop, you'll be on your desktop.

Tue, 6 Nov 2007

linux: dual display

Dual display is supported by the NVidia card, a feature they call TwinView. Their own site hosts detailed instructions on configuring TwinView.

It works best for me if the displays are side by side to make a single wide display; or as a clone display used in conjunction with a projector.

First, make sure you're using NVidia's own driver instead of the generic one distributed with most linux distros. (Unfortunately, this violates free software purity). You will probably find something called nvidia-glx or something like that: that's the name of the Ubuntu package for it).

You can tell if you look at the Driver section of your /etc/X11/xorg.conf file, the driver should be called nvidia and not nv.

Clone display, laptop and projector

This runs a clone display suitable for most projectors:

Section "Screen"
    Identifier     "laptop-projector"
    Device         "NVIDIA Corporation NVIDIA Default Card"
    Monitor        "Generic Monitor"
    DefaultDepth    24
    Option         "TwinView" "True"
    Option         "TwinViewOrientation" "Clone"
    Option         "UseEdidFreqs" "True"
    Option         "MetaModes" "DFP-0: 1920x1200, 1024x768 @1920x1200 +4+124"
...

DFP-0 identifies the primary (laptop) display, the second clones a 1024x768 window within it to the second display with a +4+124 offset from the top left corner. The offset makes allowance for window manager decorations, which you don't want to see on the projector.

If X is already running when you connect to the projector, you may need to restart X to detect the secondary display.

Dual display

Side-by-side, you can have two monitors showing one giant display, like so:

Section "Screen"
    Identifier     "dual-1600"
    Device         "NVIDIA Corporation NVIDIA Default Card"
    Monitor        "Generic Monitor"
    DefaultDepth    24
    Option         "TwinView" "True"
    Option         "TwinViewOrientation" "RightOf"
    Option         "UseEdidFreqs" "True"
    Option         "MetaModes" "1920x1200, 1600x1200"

makes a single 3520x1200 display. Note that this only effective if the two monitors have a mode share a Y resolution.

Fri, 2 Nov 2007

FRE: checkpointing, add user control

We currently signal checkpoints by having Ops create a flag in the directory /home/gfdl/flags:

if ( -f /home/gfdl/flags/fre.checkpoint.\$HOSTNAME || \
     -f /home/gfdl/flags/fre.checkpoint.all || \
     -f /home/gfdl/flags/jobs/fre.checkpoint.\$JOB_ID ) then
#exit gracefully...

I propose we extend this to allow users also to bring down their own jobs, by creating the file \$HOME/fre.checkpoint.\$JOB_ID.

The frerun-generated jobscript can be made to delete this file on its way down, so you don't get a backlog of these files sitting around in \$HOME. (Though even if you did, it's a one-in-a-million chance, quite precisely, against your checkpointing another job inadvertently... that's how long it takes for JOB_ID to get recycled).

Fri, 2 Nov 2007

FMS: online checkpointing bug

Online checkpointing is now working and has been checked in on the omsk_vb branch.

There is a bug with the debug template (checks and warnings turned on). An attempt to read the value of the function checkpoint, which looks like the logical variable checkpoint, results in an RTL error. This needs to be reported to Intel.

Here's how to recreate the error:

cvs co -r omsk shared
cvs up -r 1.1.2.4 shared/platform
cd shared/platform
mkmf -t /home/fms/bin/mkmf.debugtemplate.ia64 \
   -c"-DGFDL_HPCS -Dtest_checkpoint -Duse_libMPI" ../include ../time_manager ../fms \
   ../mpp ../mpp/include ../constants ../memutils
make
cat << EOF > input.nml
&checkpoint_nml day=1 /
EOF
mpirun -np 1 a.out

This returns

forrtl: severe (193): Run-Time Check Failure. The variable
'checkpoint_mod_mp_checkpoint_\$CHECKPOINT' is being used without being
defined
...
MPI: #10 0x400000000113ccd0 in CHECKPOINT_MOD::checkpoint(time=struct time_type { ... }) "checkpoint.F90":101
MPI: #11 0x400000000113d550 in checkpoint_test() "checkpoint.F90":127
...

That "variable" is actually a function return value and is correct. The same code will work corrctly if you turn off error checking in the compiler (i.e use mkmf.template.ia64 instead).

Sun, 14 Oct 2007

FMS: possible time manager bug

So here's a relevant code fragment (from checkpoint_mod):

integer :: sec, min, hr, dy, mon, yr
type(time_type) :: time, last
call get_date( time-last, yr, mon, dy, hr, min, sec )

When I print the results of this, I get:

checkpoint time= 1982 Jan 01 12:00:00
checkpoint last= 1982 Jan 01 00:00:00
checkpoint diff=      1      1      1     12      0      0
checkpoint time= 1982 Jan 02 00:00:00
checkpoint last= 1982 Jan 01 00:00:00
checkpoint diff=      1      1      2      0      0      0

where the diff line is printing out yr, mon, dy, hr, min, sec.

I think the first diff ought to read 0 0 0 12 0 0 and the second ought to read 0 0 1 0 0 0.

What's going on? I find the time_manager_mod code too convoluted to follow.

Sat, 13 Oct 2007

FMS: Online checkpointing

Checkpoint/restart or CPR is a mechanism to intervene in a running job, to cause it to exit gracefully upon receiving a signal, and then be able to restart and continue exactly from the state where it left off. It's a useful feature that's recently taken on some urgency, as we're finding the system idles when a large PE-count job is waiting to be loaded, and Ops is waiting for that many PEs to empty.

Currently checkpoints are enabled in FRE, at the script level. Online checkpointing refers to checkpoints at the code level, which can be finer-grained: FRE can only intervene at intervals of a run segment, which can be several months or years of model time.

To enable online checkpoints I've added a new module checkpoint_mod to FMS. It's currently added on the omsk_vb branch of shared/platform, but perhaps should go elsewhere... shared/fms maybe?

The idea is that the main loop in coupler_main checks every coupling timestep whether a checkpoint is triggered. The trigger is site-specific and non-portable; it's thus enclosed in an ifdef. It currently works only on the HPCS, and the shared code needs to be compiled with -DGFDL_HPCS to turn it on.

The interval at which the model is checkpointable (yes, I know this isn't a word) is specified in namelist checkpoint_nml. It should be an exact multiple of the coupling timestep, obviously, but more subtly, should be a multiple of any time-averaging interval specified in the diag_table. Neither of these is currently enforced in the code... you have to be sure and specify a valid checkpoint interval. We recommend month=1 in checkpoint_nml.

The trigger at this time is the existence of a file that is created by Ops when they want to signal to jobs to come down.

Initial tests show that there are problems with frerun. It still seems to apply the wrong timestamp to some files.

Steps remaining:

make sure frerun can handle these unscheduled exits.
restart a checkpointed job and make sure it passes the restart test.
CPR a production job to make sure frepp can handled unpredictable and uneven segments.
add some more information to stdout and stderr about how a checkpoint was triggered.

It can be checked out using

cvs up -r omsk_vb coupler shared/platform

This is a branch tag and is still evolving. I will send out a static tag and instructions when it's ready for wider testing.

Mon, 16 Jul 2007

FMS Revised data estimates

Model name	Restarts (GB/y)	History (GB/y)	Thruput (y/d)	#runs	Output (GB/d)
MOMp25	40	76	\|	\| `jwd` mail 26 Jun 2007
M180	17	66	\|	\| `bw` hurrell (nor NARCCAP) run, see note
M90	4	16.5	\|	\|
MOMp25+M90	44	92	2	2	547
C192	18	70	4	1	354
C384	72	280\| 2	1	708

Total data volume output per day is 1600 GB/d. Minimum sustained bandwidth to stream this data back is 150 Mbps.

Associated spreadsheet is /home/vb/proposals/ornl/DataVolume.ods.

Wed, 11 Jul 2007

FMS: Data output estimates

Estimates are in GB for a model year:

Model name	Restarts	History	source
MOMp25	40	76	`jwd` mail 26 Jun 2007
M180	17	91	`bw` NARCCAP run, see note
M90	4	23	scaled from M180
MOMp25+M90	44	99	see above
C192	18	96	scaled from M180 (1.06)
C384	72	192\| scaled from C192

Source for M180 is /archive/bw/fms/memphis/narccap/m180_narccap_hist
C192 implies 192x192x6 points; M180 implies 576x360
2 job streams of MOMp25+M90, one each of C192 and C384 running at 2 years a day imply a total data output of 664 GB/day (or 61 Mbit/s). This is the sustained bandwidth requirement between ORNL and /archive. In addition, we should probably require at least 10 days worth of staging disk and 10 days worth of scratch: 14 TB. At 3y/day it's all x1.5.

Thu, 7 Jun 2007

FRE: how FRE jobs get checkpointed

The design developed by Amy and Tim is that FRE scripts will periodically check for the existence of a system file; and if it exists, bring the model down gracefully, like a deflating Zeppelin. The scripts resubmit themselves to the queues again.

Here is the relevant fragment of code from frerun generated script code:

#checkpoint -- if system requests jobs exit, resubmit this script and
t
if ( -f /home/gfdl/flags/fre.checkpoint.\$HOSTNAME || \
     -f /home/gfdl/flags/fre.checkpoint.all || \
     -f /home/gfdl/flags/jobs/fre.checkpoint.\$JOB_ID ) then
   qsub \$scriptName
   unset echo
   echo exiting early by HPCS request, resubmitting...
   Mail -s "job \$JOB_ID \$name has been checkpointed" \$USER <<END
Your FRE production job ( \$JOB_ID ) has been stopped and
resubmitted to the batch queue.  It will be re-run by the operators
as soon as possible.

Job details:
\$name (run \$ireload, loop \$irun) running on \$HOST
Batch job stdout:
\$SGE_STDOUT_PATH
END
   sleep 30
   exit 5
endif

Ops will create one of those fre.checkpoint files to signal to FRE scripts that they are asking for a shutdown: which might be of the single HPCS node HOSTNAME, or the entire HPCS ("all"), or of only the job JOB_ID.
FRE then will qsub itself, update any state files (such as recording the current timestamp), and exit.
Whether the resubmitted job goes to the top or bottom of the queues is a matter of site policy, and outside the scope of FRE. However, it's possible for Ops or SGE (the HPCS scheduler) to recognize checkpointed jobs, so any policy is implementable. Site policy will be designed to provide incentives for users to make their jobs checkpointable.
Since the mechanism is so simple, you could easily modify your non-FRE script, or even model code, to check for the existence of these files, and take appropriate action.

Tue, 29 May 2007

FRE: reducing queue wait time for frepp

A very simple change to FRE might be for the runscript to do its qsub frepp at the beginning of the job (with -hold_jid \$JOB_ID) at the beginning of the job, rather than the end? See discussion below...

Tue, 29 May 2007

FRE: breaking up FRE scripts into scriptlets.

Several times in the past, we have considered breaking up the FRE-generated shell scripts into smaller scripts. A higher granularity of scripts submitted to the scheduler would in principle allow a tighter control of system load: visualize the queuing problem as a Tetris puzzle with time on the Y axis, and processors on X.

A simple example of how this might be done would be consider the script generated by frerun to have 3 stages: data load; parallel run; data offload. The load and offload stages move data between archive (slow access) and scratch (fast access) and are essentially single-PE, multiple-IOstream, large-memory. The run stage is high-PE-count and runs directly to and from scratch storage. As the resource requirements are different, we could consider splitting the runscript up into 3 jobs or scriptlets, where only the run scriptlet would hold down many processors.

The implications of so configuring a job are several:

scratch storage is required to be persistent: our current model of data storage is based on scratch being a job-specific directory created on local disk.
There would be 3 times as many jobs on the system. On a system dominated by queue wait times (as is the case now on the HPCS) this is certainly a consideration.
Total residence time of data on scratch would be from the beginning of the first scriptlet to the end of the third.

There are two ways to create persistent scratch:

one is to make your own subdirectory under /vftmp/\$user instead of the \$TMPDIR issued to you by SGE. There are two disadvantages: one is that /vftmp/\$user is private to a node; all scriptlets seeking to share an instance must execute on the same node. The second is that the user becomes responsible for deleting the workspace, unlike \$TMPDIR which is automatically scrubbed at the end of a job. The residence time of data in scratch storage (GB-hours) must be modeled and compared to the total available.
Alternatively you create it in the "semi-permanent" storage areas /work and /ptmp. (Current policy is that users cannot directly use /ptmp; use is reserved for caching archiver written by Tim Yeager, called hsmfiles.) Semi-permanent storage is on CXFS, so using it as scratch (where there may be many small writes to disk...) carries the risk of creating a lot of CXFS metadata traffic.

We can model the behaviour of this by expressing three quantities; the total time TT (hours), CPU residence time PT (CP-hours) and scratch data residence time DT (GB-hours) of the load/run/offload scriptlet sequence. We cannot of course simultaneously minimize all three.

Broadly speaking, TT is a measure of an individual scientist's satisfaction; PT is a measure of overall system health and efficient use; and DT is constrained by the maximum disk size. DT need not be minimized if it stays below DTMAX: but filling the disk is likely to result in painful disruption in terms of lost jobs and recovery time.

Call the scriptlets L, R, and O; each has an associated queue wait time Q, runtime T, processor count P and scratch space S.

The processor count for the L and O scriptlets is set to 2: the minimum on the current HPCS.

When this is run as a single job:

TT = QL + TL + TR + TO
PT = P*(TL + TR + TO)
DT = TL + S*(TR+TO)

When this is run as 3 scriptlets independently submitted to a scheduler:

TT = QL + TL + QR + TR + QO + TO
PT = P*TR + 2*(TL + TO)
DT = TL + S*(QR + TR + QO + TO)

While there is a clear advantage in PT for the second method (especially when P is large) there is a disadvantage in the other two measures of resource usage, which arises from the additional queue wait times associated with the later jobs in the sequence, QR and QO.

SGE allows you to minimize QR and QO by submitting all three jobs at once (rather than having each job submit the next), and declaring a dependency (see the qsub manpage for the hold_jid flag). We submit the jobs all at once in sequence, getting jobs j1, j2 and j3 submitted as follows:

qsub load_expt (qsub returns a job ID j1 on stdout)
qsub -hold_jid j1 run_expt (returns j2)
qsub -hold_jid j2 offload_expt

Job j2 and j3 are placed in a hold state until their dependencies are fulfilled, but continue to advance in the queue. Thus, while queue wait times may not go to zero, some fraction of QR and QO overlap with prior jobs in the sequence. Another limiting factor on reduction of queue wait times is that the three jobs have to go on the same node if scratch is local disk. Making scratch a shared disk is a significant re-architecture of the HPCS, and should be approached with caution.

There is also an advantage in making TR very large: this can be achieved by increasing the maximum job time beyond the current 10 hours.

We are currently exploring this method to see if we can empirically get ranges of values for some of the numbers above, for typical FRE production jobs.

It is to be noted that production jobs resubmit themselves, and with careful design, this sequence of jobs with dependencies and overlaps can be very long indeed. It can be extended to frepp as well, which is a separate discussion.

Tue, 29 May 2007

FMS: use of dplace on the HPCS

On the newer installation of the OS on the HPCS, known as Propack5SP1, there is a new version of the command dplace, which has been shown to significantly improve performance of some codes. It is also unlikely ever to degrade performance.

The way to tell if a node is running Propack5SP1 is to type cat /etc/sgi-release:

ic9% cat /etc/sgi-release
SGI ProPack 5SP1 for Linux, Build 501r2-0703010508

ic5%] cat /etc/sgi-release
SGI ProPack 5 for Linux, Build 500r1-0607180902

On Propack5SP1, use the syntax

unsetenv MPI_DSM_DISTRIBUTE
mpirun -np \$npes dplace -o -r bl -s 1 -c 0-`expr \$npes - 1` foo.x

On Propack5, use the syntax

unsetenv MPI_DSM_DISTRIBUTE
mpirun -np \$npes dplace -o -s 1 -c 0-`expr \$npes - 1` foo.x

... to run the program foo.x on \$npes processors.

There seems to be some complication due to the use of GFDL's mpirun "wrapper": please follow the progression of the ticket #23330 on the GFDL HelpDesk to see how this resolves.

Please note the unsetting of MPI_DSM_DISTRIBUTE: it appears that the dplace arguments have no effect if this is set.

Mon, 28 May 2007

FRE: fremake bugfix for top-level Makefile failure

Several users had reported a sometimes failure from the library-based compilation script from fremake. The symptom was an incomplete library list on the ld line, and usually a failure to find any main program:

This turns out to be because the GNU Make automatic variable \$? behaves somewhat differently from how I was interpreting it... using \$^ instead seems to fix it:

      \$(LD) \$^ \$(LDFLAGS) -o \$@

This has been updated on the branch nalanda_vb_arl of fremake.

Another minor bugfix shows up when using fremake -t foo. Then the top-level Makefile was being created in \$HOME and not in \$rootdir/foo. Also fixed on the same branch.

Thu, 17 May 2007

FRE: strange perl behaviour of \$!

In many places in FRE.pm we have error-checking of shell commands of the form:

qx/something/;
croak "nevermore: \$!" if \$!;

... however, this doesn't seem to be a good idea, because \$! seems often to be non-null even when no error.

So, I noticed qx/foo/ resolves to the value of stdout at the end of command foo... so I replaced the ones in fremake with

print STDERR qx/something/;

(I also noticed that in mkmf I have lines like

unlink \$object or die "\aERROR unlinking \$object: \$!\n";

... which works fine. Perhaps "\$! is non-null if error"; but not "error if \$! is non-null"?

Wed, 16 May 2007

FMS: cube sphere and lat lon AMIP code

Here is some version 3 FRE to express the cube sphere compilation...

   <component name="fms" paths="shared" includeDir="\$FREROOT/c48_amip_test/src/shared/include">
      <source versionControl="cvs" root="/home/fms/cvs">
         <codeBase version="nalanda_2007_04">shared</codeBase>
         <csh>cvs update -r nalanda_cube_vb shared/amip_interp shared/topography</csh>
      </source>
      <compile>
         <cppDefs>-Duse_libMPI -Duse_netCDF</cppDefs>
         <srcList>
           /home/ck/nalanda/src/UPDATES/topography.F90
         </srcList>
      </compile>
   </component>
   <component name="atmos_phys" paths="atmos_param atmos_shared" requires="fms" includeDir="\$FREROOT/c48_amip_test/src/shared/include">
      <source versionControl="cvs" root="/home/fms/cvs">
        <codeBase version="nalanda_2007_04">atmos_shared atmos_param</codeBase>
      <csh>
cvs update -r latlon2d_bw `/home/fms/bin/list_files_with_tag latlon2d_bw`
cvs update -r merge_rwh_latlon_bw_wfc `/home/fms/bin/list_files_with_tag merge_rwh_latlon_bw_wfc`
      </csh>
      </source>
      <compile>
         <srcList>
           /home/ck/nalanda/src/UPDATES/interpolator.F90
           /home/ck/nalanda/src/UPDATES/mg_drag.F90
         </srcList>
      </compile>
      <compile target="debug">
         <cppDefs/>
         <mkmfTemplate>\$FREROOT/site/mkmf.debugtemplate.ia64</mkmfTemplate>
      </compile>
   </component>
   <component name="atmos_dyn" requires="fms atmos_phys" paths="atmos_coupled
                                                                atmos_cubed_sphere/driver/coupled
                                                                atmos_cubed_sphere/model
                                                                atmos_cubed_sphere/tools">
      <source versionControl="cvs" root="/home/fms/cvs">
         <codeBase version="nalanda_2007_04">
            atmos_coupled atmos_cubed_sphere/driver/coupled atmos_cubed_sphere/model atmos_cubed_sphere/tools
         </codeBase>
      <csh>
cvs update -r latlon2d_bw `/home/fms/bin/list_files_with_tag latlon2d_bw`
cvs update -r merge_rwh_latlon_bw_wfc `/home/fms/bin/list_files_with_tag merge_rwh_latlon_bw_wfc`
  rm -f atmos_cubed_sphere/tools/pp
      </csh>
      </source>
      <compile>
         <cppDefs>-DSPMD -DZERO_ZS</cppDefs>
         <srcList>
           /home/ck/nalanda/src/UPDATES/atmos_model.F90
         </srcList>
      </compile>
      <compile target="debug">
         <cppDefs>-DSPMD -DZERO_ZS</cppDefs>
         <mkmfTemplate>\$FREROOT/site/mkmf.debugtemplate.ia64</mkmfTemplate>
      </compile>
   </component>
   <component name="ice" paths="ice_amip ice_param" requires="fms">
      <source versionControl="cvs" root="/home/fms/cvs">
         <codeBase version="nalanda_2007_04">ice_amip ice_param</codeBase>
      </source>
      <compile>
         <srcList>
           /home/ck/nalanda/src/UPDATES/ice_model.F90
         </srcList>
      </compile>
      <compile target="debug">
         <cppDefs/>
         <mkmfTemplate>\$FREROOT/site/mkmf.debugtemplate.ia64</mkmfTemplate>
      </compile>
   </component>
   <component name="land" paths="land_lad land_param" requires="fms">
      <source versionControl="cvs" root="/home/fms/cvs">
         <codeBase version="nalanda_2007_04">land_lad land_param</codeBase>
      </source>
      <compile>
         <cppDefs>-DLAND_BND_TRACERS</cppDefs>
         <srcList>
           /home/ck/nalanda/src/UPDATES/land_model.F90
           /home/ck/nalanda/src/UPDATES/land_properties.F90
           /home/ck/nalanda/src/UPDATES/rivers.F90
           /home/ck/nalanda/src/UPDATES/soil.F90
           /home/ck/nalanda/src/UPDATES/vegetation.F90
           /home/ck/nalanda/src/UPDATES/numerics.F90
           /home/ck/nalanda/src/UPDATES/climap_albedo.F90
         </srcList>
      </compile>
      <compile target="debug">
         <cppDefs>-DLAND_BND_TRACERS</cppDefs>
         <mkmfTemplate>\$FREROOT/site/mkmf.debugtemplate.ia64</mkmfTemplate>
      </compile>
   </component>
   <component name="ocean" paths="ocean_amip" requires="fms">
      <source versionControl="cvs" root="/home/fms/cvs">
         <codeBase version="nalanda_2007_04">ocean_amip</codeBase>
      </source>
      <compile/>
      <compile target="debug">
         <cppDefs/>
         <mkmfTemplate>\$FREROOT/site/mkmf.debugtemplate.ia64</mkmfTemplate>
      </compile>
   </component>
   <component name="coupler" paths="coupler" requires="fms atmos_phys atmos_dyn ice land ocean">
      <source versionControl="cvs" root="/home/fms/cvs">
         <codeBase version="nalanda_2007_04"> coupler </codeBase>
      <csh>
cvs update -r latlon2d_bw `/home/fms/bin/list_files_with_tag latlon2d_bw`
cvs update -r merge_rwh_latlon_bw_wfc `/home/fms/bin/list_files_with_tag merge_rwh_latlon_bw_wfc`
      </csh>
      </source>
      <compile>
         <srcList>
           /home/ck/nalanda/src/UPDATES/coupler_main.F90
           /home/ck/nalanda/src/UPDATES/flux_exchange.F90
           /home/ck/nalanda/src/UPDATES/surface_flux.F90
         </srcList>
      </compile>
      <compile target="debug">
         <cppDefs/>
         <mkmfTemplate>\$FREROOT/site/mkmf.debugtemplate.ia64</mkmfTemplate>
      </compile>
   </component>

As you can see, almost every component is updated, so you cannot share libraries with the m45_am2p14 experiment, not even libfms.a!

The sooner we can apply testing to the latlon2d_bw and merge_rwh_latlon_bw_wfc branches, the better... see Task 119.

PS. Also note the line rm -f atmos_cubed_sphere/tools/pp under the component atmos_dyn. The tools/ directory should be parallel with atmos_cubed_sphere maybe?

Wed, 16 May 2007

FRE: cvs updates and the new fremake

I am having a current problem with the new fremake. The cube-sphere currently is running with two "update" tags, latlon2d_bw and merge_rwh_latlon_bw_wfc, both of which span multiple components.

So if you're going to add a line under <source> like:

<csh>cvs update -r foo `list_files_with_tag foo`</csh>

... which component should it go under? Ideally it should only be executed once, but since the order of components is unpredictable (it's ordered by keys %codeBase and perl hash order is undefined).

Currently I've repeated it in all components, which is wasteful. A change requires the ability to restrict list_files_with_tag to a single component: currently it's invoked in src/ so it updates everything under there. See Task 117.

Second: The <csh> tag above is quite inelegant (that whole list_files_with_tag construct is annoying to the code aesthetics mafia...)

I propose:

<component>
  <source>
    <codeBase version="nalanda">
    <codeUpdate version="nalanda_2007_04">
    <codeUpdate version="latlon2d_bw" type="delta">
  </source>
</component>

The first codeUpdate replaces entire modules (i.e it's equal to cvs update -r nalanda_2007_04 component. The type=delta says that the second one is incremental:

cvs update -r latlon2d_bw `list_files_with_tag latlon2d_bw`

See Task 118.

Tue, 15 May 2007

FRE: invoking dplace from frerun

Currently the run command is something like

time -p mpirun -np \$npes a.out

There's a need to be able to customize this for that dplace stuff Chris spoke about on 14 May 2007.

I am currently doing this by hand using an environment variable called MPIRUN_EXEC_PREFIX and then using it on the mpirun line:

setenv MPIRUN_EXEC_PREFIX "dplace -r bl -s 1 -c 0-`expr \$npes - 1`"
...
time -p mpirun -np \$npes \$MPIRUN_EXEC_PREFIX a.out

MPIRUN_EXEC_PREFIX points to something that will be invoked by mpirun and which in turn will invoke the a.out. This is how you invoke many performance profilers as well.

Task 116 requests that this behaviour be built into frerun.

Tue, 15 May 2007

FRE: frestatus redesign

As we go through and redesign the FRE scripts, the key issue is to make the perl code itself site-independent, and retrieve site-dependent stuff from fre.cshrc.

How do we do that for frestatus? What it does, mainly, is to grep for a status string within an output file.

I believe this can be achieved using two environment variables, to be set in fre.cshrc:

\$FRE_OUTPUT_FILE_REGEXP_LIST is a list of regular expressions that match output filenames. For instance, the current output file for experiment \$expt looks like \$ARCH/\$expt/*/ascii/stdout for frerun and \$root/\$expt/exec/stdout for fremake. The new style of fremake output files looks like \$FREROOT/stdout/compile_\$expt.csh.o* or something. This file regexp will be a function of the system queuing software.
\$FRE_STATUS_STRING_REGEXP_LIST is a list of regular expressions that match the status string you look for. For instance fremake reports either the string ERROR: make failed or NOTE: make successful. The status string is a function of the FRE software.

Some pseudocode (which definitely does not work, but presents an interface in the new style:-) is checked in on the nalanda_vb tag of frestatus (14.1.4.1). The request for implementation is Task 115.

Mon, 30 Apr 2007

FMS FRE: ensemble parallelization

Notes from an exchange with Rich Gudgel about parallelizing post-processing: can some of this make its way into frepp?

The question is about simulatenously managing output from a large number of runs: typically ensembles. In this case it's using the EAKL ensemble filter.

The first question is about where to store intermediate files. The recommendation is to use /work:

Running mppnccombine on /ptmp (or actually, /work, because /ptmp is intended for use by FRE itself...) has the same downsides as running it in /archive. It generates many small disk writes on CXFS which bogs things down on the filesystem.

You could copy your uncombined files to /work/rgg after your 200p job exits, and then launch the mppncombine from a separate script. But that second script should still copy to /vftmp, run mppnccombine, then copy the output back to /work (or /archive). That's what ptmp-enabled FRE does. /work is underutilized at this point (3TB used out of 19TB available) and everyone already has their /work/user directory, so please use it! You can save DMF traffic by using /work as storage for your "pre-mppnccombined files".

The second question is about parallelizing mppnccombine. Rich wonders if:

I can utilize a method Ron Stouffer said is available where I can use my existing requested processors (200) to distribute each mppnccombine to the desired files (with 16 files per 20 cpus this actually allows for a pretty reasonable usage of each cpu as long as I can assign a cpu to each file.

But the answer is:

You cannot parallelize one mppnccombine, but what you could do is to parallelize across your ensemble. (so, for your 10-member ensemble, 10 jobs).

How to automate the process? One way of course is for FRE to become "ensemble-aware", so this will be done by frepp. We have this on the list of future requirements for the "new FRE", so stay tuned (but not with bated breath...)

What you can do in the interim:

One mode of parallelism that's underexploited is qsub task parallelism. (Do man qsub and look for the -t option).

Write a script to do one ensemble member, which is identified by some shell variable, e.g \$member, so that if \$member=1, you process ensemble member 1, etc.

The script should set

   set member = \$SGE_TASK_ID

Then launch 10 of these in parallel using qsub -t 10 script. You will then have a 10-PE job running, and qsub will set SGE_TASK_ID differently on each PE, so you'll process your whole ensemble.

You could further parallelize by month, if you wanted to get fancy:

   set member = \$SGE_TASK_ID/12 + 1
   set month  = \$SGE_TASK_ID%12 + 1

and use qsub -t 120 script to launch a 120-way parallel job to do all the months for all the ensemble members.

Mon, 30 Apr 2007

FRE: bugs in new FRE

fremake -n turns on the nocheckout option which means you never enter the createSrc routine in FRE.pm. If you don't, then the hash \$srcDir{\$expt} is not set, which causes createExec to fail.
we previously had one instance of the perl variable \$! returning an error code, even though there was no error: this was in L309 of FRE.pm, now commented out. I also got the same error (command succeeded but \$! returns "No such file or directory") on L247. I changed the error to a warning, which still isn't satisfactory, but at least it keeps going...
L398 substitutes instances of \$FREROOT with \$rootdir{\$fre}. which is incorrect... isn't it? At least there's no requirement that the user set <directory type="root"> to FREROOT...

And these two items are design issues and not really bugs, I think

New fremake does not understand shell variables like \$root... should it?
frerun -t behaviour needs to be changed so old frerun can be used with new fremake.

And this is a feature not a bug!

<pathNames> is used at compilation, not at checkout, so it's a subnode of <compile> not <source>.

Tue, 24 Apr 2007

FMS: conventions for unit test programs

Unit test programs are now bundled with the module whose function they test. They are usually an appendage at the bottom of the source file. Thus, for module modname, we have the file modname.F90, which has, at the end:

#ifdef test_modname
  program test
  ...
#endif

If unit tests are stored in their own file, the proposed convention is:

the unit test for module modname_mod should be named test_modname.F90 and be enclosed in #ifdef test_modname...#endif
if it needs a namelist, it should be test_modname_nml

Zhi Liang will be implementing this in mpp.

Wed, 11 Apr 2007

FRE FMS: the modules

Here is a list of modules available in the current CVSROOT/modules file, grouped by the component they might match:

fms:: shared, may need a version of shared without shared/coupler
atmos_phys:: atmos_param_am2, atmos_param_am3, atmos_param_zetac along with atmos_shared; may need a module atmos_param_hs for the Held-Suarez
atmos_dyn:: fv, bgrid, spectral, ...

fv should be given a module consisting of atmos_fv_dynamics/driver/coupled atmos_fv_dynamics/model atmos_fv_dynamics/tools atmos_coupled, similarly for the others

Tue, 10 Apr 2007

latex: tex4ht getting better!

I am now able to process images directly following instructions on Gurari's Troubleshooting page: see section on direct processing of EPS figures. The relevant lines in balaji.cfg are

\Configure{graphics*}
{eps}
{\Needs{"make -f \$TEXROOT/Makefile -r \csname Gin@base\endcsname.png"}% see \$TEXROOT/Makefile
  \Picture[\csname Gin@base\endcsname.eps]{\csname Gin@base\endcsname.png class="graphics"}%
}

and the Makefile knows how to convert eps to png.

setenv TEX4HTENV \$TEXROOT/tex4ht.env

settles the issue of a common .env file. (which can also be done using the -e argument to tex4ht and t4ht).

Gurari's Q/A page is another useful page to look at. In general, stuff is not easily linked up, but a google search on site:http://www.cse.ohio-state.edu/~gurari foo seems to work...

Many things are fixed directly in \$TEXROOT/Makefile.

Thu, 5 Apr 2007

FRE: more on the TODO list

At present make clean does not work in the top-level directory.

One solution would be for the top-level Makefile to define:

clean:
    make -f Makefile.fms clean
    make -f Makefile.ocean clean
    ...

Wed, 4 Apr 2007

latex: hyperlatex

Working, sort of...

Wed, 4 Apr 2007

latex: tex4ht wrapper t4post

The standard dvips is producing small images. I attempted to scale it up using -x 2000 with and without -E... no fun. How to get larger images? Also appears to be no way to correct this in CSS... the <img> tag has height and width in units of pt.
In \$HOME/tex/bin I now have scripts to aid post-processing. After running htlatex, run t4post in that directory... it puts all HTML files in the right wrapper.
CSS problems are mostly fixed by using the NoFonts flag. I now have a single file called \$HOME/css/tex4ht.css that may be fine-tuned.

I also downloaded hyperlatex, which might just be easier! (not sure).

Tue, 3 Apr 2007

latex: tex4ht progress

tex4ht looks like it ought to do whatever I want, ("highly configurable", etc) and besides, the Debian port is run by Kapil Paranjape! We've been corresponding about it, among other things.

The original doc from Eitan Gurari is somewhat impenetrable, but I found some decent documentation elsewhere, which may be somewhat out of date.

Problems:

I'm having problems with my \includegraphics{} EPS images.
my PHP header and footer?
CSS is generating the same for boldface and plain text.

By using a two-step process using dvips instead of Kapil's Debian default dvipng I am now generating images from \includegraphics{}.

These lines are now enabled in \$HOME/tex/articles/testhtml/tex4ht.env (why won't this work when I use \$HOME/.text4ht.env instead?):

I think the best way might be to post-process the output from tex4ht to embed the <body> portion of the output within our standard header and footer.

By using the NoFonts option I seem to have reduced the number of CSS lines to fix. Now I think the CSS file is somewhat static... should be able to replace it with something.

Tue, 3 Apr 2007

fre: libraries for nalanda_2007_04?

F-group will be using the new FRE environment for testing nalanda_2007_04 to make libraries. I am proposing that we set up the libraries in /home/fms/... prior to release, and test that we can pass RTS using those libraries.

Start with downloading your own instance of the FRE scripts:

\$HOME/fms/bin/fre_setup \$HOME/fre
source \$HOME/fre/site/fre.cshrc

(For now use absolute paths for the FREROOT directory argument... fre_setup isn't dealing with relative paths correctly).

User fms will:

identify the list of components that will need libraries: though the release may begin with only shared, we need to have a list of the required components.
make sure each required component has its own CVS module within CVSROOT/modules.
Set up three <compile> nodes for each target: one with no target and the production template, one with target="debug" and the debug template, one with target="flt" and the fltconsistency template.
Writing the libraries should be part of the checklist for "moving the testing tag".

Liaisons will conduct runs using the precompiled libraries as far as possible.

Mon, 2 Apr 2007

FRE: Notes and fixes

fre_setup doesn't work correctly if its argument (which will become FREROOT) is a relative pathname.
The compile script uses make instead of gmake. (Actually those are synonyms, but it's been noted that if you want to re-alias make (e.g alias make make -j 8) you have to re-alias make and not gmake.
You might see error messages from perl that may say Bad file descriptor in some odd instances: those seem to be some bug or un-understood phenomenon at this point.

Mon, 2 Apr 2007

FRE: immediate TODO list

fre_setup doesn't work correctly if its argument (which will become FREROOT) is a relative pathname. See /home/vb/tmp/fre/foo/fre_setup for a fix.
fre_setup checks out the testing tag of bin2, etc! Is that OK? Should be the HEAD tag, probably...
does fremake break frestatus? We need to settle the issue of where job output goes, so that frestatus can use it.
settle the issue of how fre.cshrc is to know where experiment-specific stuff goes.
get canonical XML for checkout and compile working again (both writing the XML file, and verifying).
do the top-level Makefile along with the top-level component (actually the component should tell it when a load is required).
Figure out how to suppress those Bad file descriptor messages.
To be consistent with what went before, <mkmfTemplate> should either have a file in a file attribute, or the template directly loaded in between the start and end tags. Right now we have the filename between the start and end tags.cd ..

Thu, 29 Mar 2007

tlemcen: configuring X for projectors

The safe configuration for projectors at this time seems to be a 1024x768 display. However, tlemcen is a laptop with a 1920x1200 native configuration running NVidia graphics.

I found Christophe Troestler's page on laptop configuration to be a useful resource describing how to set things up to project correctly from a subwindow of a display that's larger than 1024x768.

The key steps are as follows:

Run a "nested" X display that's a reasonable size for projection,
Set a "panning domain" overlapping with the nested display and send only that domain to the second screen (projector).
Run your presentation full-screen within that nested display.

In greater detail:

install Xnest: a package that's part of XFree86 that runs a "nested X server". (Installing for me is as simple as typing apt-get install xnest). Xnest will act like an independent X display, running something "full-screen" in it will only occupy the Xnest window of the real screen.
always run your Xnest window in a fixed location, e.g Xnest -geometry 1024x768+0+0, which will put it at the top left of your real screen.
It will probably be slightly offset from the origin (top-left corner, nominally (0,0)) of the real screen: you can find the actual location of the window by typing xwininfo and pointing at your Xnest window. Near the end of the output you'll see the list of corner coordinates. The first corner is the one closest to the origin, something like +4+28.
set up your dual display to clone your screen to the secondary output. In my case (nvidia-xconfig) that gives you lines in the Screen section of your X configuration file (in my case, /etc/X11/xorg.conf) file that looks something like:

Option "TwinView"            "True"
Option "TwinViewOrientation" "Clone"
Option "UseEdidFreqs"        "True"
Option "MetaModes"           "1920x1200, 1024x768 @1920x1200 +4+28"

This says the second screen (projector) receives a panning domain within the main 1920x1200 screen, which is 1024x768 in size and offset +4+28 from the origin. This panning domain happens to coincide with your Xnest window. (There is plenty more information about Nvidia X configuration if you really want it).
Now any application running full-screen in your Xnest window will appear full-screen on the projection. You also have the rest of your own screen to do stuff that won't be visible on the projection. The Xnest usage doc describes how to do this. The way I do it is:

% Xnest -geometry 1024x768+0+0 :1 &
% kpdf --display :1 /home/vb/tex/talks/fms/forum/20070328.pdf &

The first line starts up the Xnest window on display :1, the second line runs kpdf on that display.

Wed, 28 Mar 2007

emacs: the latex-beamer class

My first successful attempt at a talk using the latex beamer class! the talk is at /home/vb/tex/talks/fms/forum/20070328.tex and the PDF in the same directory.

Pretty classy look!

Wed, 28 Mar 2007

FRE FMS: MI Team meeting 28 March 2007

Today's MI Team talk is posted to the web.

Wed, 28 Mar 2007

FRE: Amy's FAQ on the dual-run capability

How do I start a dual run for a new experiment or as a reproducibility test for an old experiment?
How do I rerun just a subset of a previous experiment? Will I be charged for the hours it takes to run?
How do I tell what hosts and cpusets my previous experiments ran on?
How do I compare the results of the two runs?
For which experiments should I perform dual runs?

How do I start a dual run for a new experiment or as a reproducibility test for an old experiment?

To dual-run an experiment, set up the original experiment just as before: call frerun as usual and submit your job script. To create a second instance of the experiment as a dual run, invoke /home/fms/bin/frerun again with the same FRE schema file and experiment, but this time with the -u option. This will create a new runscript that differs from the original runscript in the following ways:

1. The output will be written to a subdirectory of the original experiment with an integer digit for a name, ie,

3. Dual-run jobs will be submitted with the qsub -A repro -l repro. Jobs submitted with these options will not be charged to allocated time, and will show up with a '2' in the STATE column of qa:

You can add other qsub -l options e.g 4700, bx2, 3700, or ic5, to direct dual run jobs (or any jobs) to specific nodes. qconf -scl reports the list of available values for -l.

How do I rerun just a subset of a previous experiment? Will I be charged for the hours it takes to run?

Use frerun -u to generate a dual-run runscript for your experiment, just as above. You can create as many distinct dual-run runscripts as you like. Then edit the runscript to change the initial conditions file to point to the appropriate file from the original experiment's restart directory.

The job will need to be stopped manually, or you can use frepriority to adjust the number of queue allocations.

If there is a need for this to be more automated, for example by providing a command line option to frerun to set the initial conditions file for the dual-run, this may be implemented later.

Since the runscript was created as a dual-run runscript, it will use the special qsub flags to indicate a dual-run, and you will not be charged for the hours.

This information is contained within the \$archivedir/expt/ascii/stdout file of your experiment. There is an option to frerun which will help you parse this file by producing a summary from the tail sheet information of your job submissions:

To compare the results of two runs, you can use frecheck. This will check all available matching output files from your original and dual runs.

The resdiff utility is located in /home/fms/bin, and a usage message is available with resdiff -h. resdiff uses cmp to compare the files within multiple cpio archives. There is another utility histdiff which does a similar thing but uses Remik's nccmp tool. This allows for more detailed comparisons; see histdiff -h for more details.

Currently there is not a more automated way to test the results, but automated mechanisms may be implemented in the future.

Dual runs should be done at the discretion of the users and scientific groups. Any runs which are deemed sufficiently important, or have had anomalous behaviour such as unexplained failures, maybe worth rerunning. A history of all jobs on the system and where they ran are available, going back more or less indefinitely.

Wed, 28 Mar 2007

FMS: April 2007 patch to nalanda

There has been a flurry of activity since the nalanda release to integrate some the irreversible changes introduced by the distribution of the shortwave flux field into streams; and the changes to incorporate the new conservation checks on water (and other quantities soon).

The changes are serious enough to warrant a new patch as soon as possible. Please follow the wiki page on post-nalanda tag moves as we begin the patch.

Wed, 21 Mar 2007

FMS: using histx

Will Cooke's notes on using histx for performance analysis:

Here are some notes I made on getting performance data out of our models. I'd leave the -d out of the histx part of the regression. i.e. do everything.

If you have time, you could try this to see if there's a sore thumb sticking out in your model runs.

Will

Method for using histx

See GFDL Wiki page on Altix profiling for detailed info.

For profiling a small portion of the code.

Add

call enable_histx
call disable_histx

around the code you want to profile.

If you're timing the entire code, start here. Add

source /home/gcs/histx_1.4b/setup.csh

to your XML setup (I'm assuming csh/tcsh is being used). The /home/gcs references should become /home/fms/... sooner rather than later.

Add

-L/home/gcs/histx_1.4b/lib -lhistx

to the LIBS variable of <mkmfTemplate> (which gets tacked on to LDFLAGS).

The code must be compiled with -g also.

fremake the code as normal. you need to relink at the minimum.

Add

<regression name="prof">
  <run days="8" months="0" npes="15" runTimePerJob="00:60:00"
       histx="-l -d -f"/>
</regression>

to your XML experiment.

-l gives line information.
-d disables the timing info until it hits call enable_histx. Remove

this is you're timing the entire code.
-f is needed for parallel code.

frerun -r prof -x ...

run the script in a cpuset environment (IC4, IC5 or qsub)

Go to the archive directory and explode the hi*.cpio file.

Run

source /home/gcs/histx_1.4b/setup.csh
iprep hi.* > my_profile.out

to get combined statistics on your section of code.

Wed, 21 Mar 2007

FMS FRE: the site configuration

Amy Langenhorst proposes a method to organize the site configuration files so that users can easily have their own copy of the site config file.

The idea is to have a script, say fre_config. fre_config will checkout (or update) a version of the site files into your directory of choice:

fre_config -r nalanda -o \$HOME/fre

will create, under \$HOME/fre, the following tree:

bin/ -- contains mkmf, fremake, frerun, etc.
lib/ -- contains FRE.pm and so on
site/ -- site-configured csh setup, mkmf templates for local architectures

(site could be renamed loc for "local", or etc but site might be clearer... the distinction I'm making is that this is usually not local to one host, but to one site. See my .cshrc, it also sources .cshrc,site, .cshrc.`uname`, .cshrc.`hostname`, ...)

site contains the file fre.cshrc, which will set up other paths, see below. fre_config will also modify the checked out copy of fre.cshrc to set \$FREROOT to the root point of this checkout (\$HOME/fre in this example...), add \$FREROOT/bin to \$PATH, \$FREROOT/lib to \$PERLLIB, and so on. fre_config will then source this file. (Actually it will print a line asking you to source this file...)

Users can make mods to their fre.cshrc, and then re-source it.

Currently fre.cshrc sets

CVSROOT: delete as this will now move into <checkout root="">
MKMF: path to mkmf
VERSIONS: saves exact per-file cvs checkout info
CVSREDO: redo the checkout
INCLUDE: path to some include files like netCDF
LISTPATHS: path_names file generator, basically a wrapper for find
BATCH_COMPILE: qsub with some defaults

We should go over this list again. e.g BATCH_RUN needs to be added. And I think REDO etc need be omitted... does anyone ever use them? They simply clutter the scripts at this point. You redo based on the canonical FRE file.

Tue, 20 Mar 2007

FRE: the dual-run capability

There are several possibilities on how to handle it.

no changes to FRE schema. frerun -u which currently works on regression tests to set up a unique run, will now also work on production. It will restart from the <initCond> file and not perform any post-processing. In short, exactly as though you created a new experiment that inherited exactly an existing one, and turned off the <postProcess> node.
The future evolution of FRE schema will move towards having a realization attribute that identifies members of an ensemble. The FRE DB will have the capability to return the exact difference in configuration between two realizations, e.g a difference between <initCond> files ("initial-condition ensemble") or between settings of some input parameter ("perturbed-parameter ensemble").

Mon, 19 Mar 2007

emacs: muse, changed extension to "emu"

Acknowledges the "emacs" part... see last two lines of emacs-muse customization, below.

(setq muse-file-extension "emu")        ; cooler than muse
(add-hook 'find-file-hooks 'muse-mode-maybe)

Mon, 19 Mar 2007

FMS FRE: notes on new repository policies and structures

As we are starting to add the feature of precompiled component libraries, it's time to take a fresh look at how to structure the repository.

The component-based FRE schema that is currently being built allows components to be built at various levels of granularity. We principally aim to provide the standard model components: atmosphere, ocean, land, sea ice. A list of such components might include:

atmos_dyn

FV
FV cube sphere
BGRID
spectral
zetac
amip
EBM
SCM
shallow water gridpoint
shallow water spectral
shallow water cube sphere
null

atmos_phys

am2
am3
simple_physics
null(?)

ocean

mom4p0
mom4p0_static_om3 (plus other static configs)
him
him static configs
mom4p1
mixed layer
amip
null

land

lm2_lad
lm3p1
null

ice

sis
amip
null

infrastructure

fms with libMPI
fms with libSMA
fms with nocomm

In addition, we might choose to package items at a higher level of granularity: e.g groups of atmospheric column physics, or ocean bio-geochemistry packages as solo components. This would require each to have at least some solo test configurations. Perhaps one useful functional definition of a "supported" model component in this setup would be the existence of a test program for running it.

It also occurs to me that we could package stuff up at a lower level of granularity: complete coupled models. Currently one way to retrieve a model configuration that is known to pass the RTS is to retrieve the RTS itself, using cvs co -r nalanda rts. The executables that are supported under this scheme could also be delivered in the same way as libraries for the components.

A package under the proposed design is a component of a recognized model configuration that is

Proposal:

Parallel to /home/fms/cvs is /home/fms/components.

Under which is a long list of components.
under which is a directory for each release ((city) and also (city)_yyyy_mm). The component and release axes are orthogonal: we choose to put component outside because the new approach has version as an attribute of <codeBase>.
under which we have directories src (checked-out source), lib (library), include (headers and modfiles), data (input files), xml (fre), exec (for application-level components).

Start with some standard ones and then keep extending? too much work for liaisons?

Thu, 15 Mar 2007

emacs: muse issues

I discovered a problem with the RDF files produced by muse-journal when there are SGML tags in the early text of the entry (which gets stuck in the RDF <description>). Fixed it by customizing muse-journal-rdf-entry-template...
still having a problem figuring out how to get complete directory trees from the muse directory published... tried the method in the example muse-init.el (with the weird-looking ,@(...) lisp expressions...) which didn't work, nor does :include yet... For the moment I am going to create separate projects for each directory.

Thu, 15 Mar 2007

LaTeX: latex2html is still quite broken

\usepackage{html} puts AucTeX into PDFLatex mode... quite annoying!
latex2html picks up latexrc.tex if it's in the local directory... appears to ignore \$TEXINPUTS... is that normal?
when latexrc is loaded the document is completely haywire...
I need to fix this to update the FMS Manual!!!

Tue, 13 Mar 2007

FRE: "componentizing" the checkout and compilation

FRE is now set to work with a new version on fremake that can link to existing libraries and headers, and skip checkout and compilation of the components you wish to use as a "black box".

Sun, 11 Mar 2007

web: using CSS to create `cobweb` and www versions of the same page

We have many pages (the FMS and FRE pages are a good example...) where we want to create pages where some information is to be made public (www) and some to be GFDL-internal (cobweb). Here is a way to use CSS to achieve this. (CSS stylesheets are the standard way to control how HTML is actually rendered on screen).

Add this to your HTML header:

<style type="text/css">
.cobweb { display:none; } /* for pages shared between cobweb and www */
</style>

This says, in CSS, that any item of class cobweb is not to be displayed.

Then, you write your webpages with the GFDL-internal information included, but enclose the information you don't displayed on the external web in:

<span class="cobweb">
GFDL-internal information ...
</span>

<span class="cobweb"> Create your webpage in /home/vb/external_html and use symbolic links to list the file also in /home/vb/internal_html. You will see the pages rendered differently in a browser when you invoke the cobweb and www URLs of the same file: the GFDL-internal information will be invisible in the www page.

As an exercise in CSS, see if you can figure out how I disabled the standard drop-down menus and the font-resizer macro in the top right using CSS... </span>

Fri, 9 Mar 2007

tlemcen: kscd autoplay

Many "modern" desktop environments seem to take a page out of the Gates playbook and try to guess what you want under most circumstances. When it's right you usually don't notice, but when it's wrong it can be quite a problem.

In this particular instance, the issue is that when you insert an audio CD, KDE automatically launches kscd, the CD player. You jmight not want that, and secondly on my current installation on tlemcen, kscd is not connecting to audio (plays but silently).

After some frustrating attempts to discover where in the KDE config it says to autoplay CDs using kscd, I gave up and simply disabled it. The first attempt,

apt-get remove kscd

can't be recommended, as it also wants to remove kubuntu-desktop.

I settled on

mv /usr/bin/kscd /usr/bin/kscd.DISABLED

which causes the KDE daemon to pop up an error message, but no matter.

Further research on the kubuntu forums shows that the file in question is ~/.kde/share/config/medianotifierrc, which in turns says that for audio CDs, start up /usr/share/apps/konqueror/servicemenus/audiocd_play.desktop, which, at the bottom, says Exec=kscd.

So, take your pick, kill kscd, delete the audiocd line from medianotifierrc, or point audiocd_play.desktop to something other than kscd.

Mon, 5 Mar 2007

netCDF: padding the file header

A problem often encountered with making changes to netCDF files is that by default, at the time the file was first created, the header is exactly the length required to hold the headers as then defined. Any subsequent attempts to change the header information using NF_REDEF (which is used for example by ncatted) involve mass data motion as the library attempts to move the entire actual data down in order to make a few bytes more space in the header portion.

One way to get around this problem is by using the two-underscore version of the header completion routine NF__ENDDEF. This version has extra arguments to create padding after the header (H_MINFREE) and after the static data (V_MINFREE). To quote the NF__ENDDEF user guide:

The minfree parameters allow one to control costs of future calls to nc_redef, nc_enddef by requesting that minfree bytes be available at the end of the section.

Here's how the call looks in Fortran:

INTEGER FUNCTION NF_ENDDEF(INTEGER NCID, INTEGER H_MINFREE, INTEGER V_ALIGN,
                    INTEGER V_MINFREE, INTEGER R_ALIGN)

NCID
    NetCDF ID, from a previous call to NF_OPEN or NF_CREATE.
H_MINFREE
    Sets the pad at the end of the "header" section.
V_ALIGN
    Controls the alignment of the beginning of the data section for
    fixed size variables.
V_MINFREE
    Sets the pad at the end of the data section for fixed size variables.
R_ALIGN
    Controls the alignment of the beginning of the data section for
    variables which have an unlimited dimension (record variables).

Implications in FMS? We can pad datasets as they are created, by using the right flavour of NF__ENDDEF in mpp_io. This is controlled right now by an obscure variable called header_buffer_val, which has to be set non-zero to turn on this feature.

As the ncatted manpage shows, if you have an existing file without the padding, you can add it using the --hdr_pad argument. This argument also exists in ncks and ncrename. I interpret the documentation to say that any subsequent processing of the file will not destroy the header. If you use it all up, and still continue to add stuff to the header, you'll just fall back to the old slow behaviour of moving all the data down in the file.

Once we fix this in mpp_io there is still an issue of code that doesn't pass through mpp_io. I've seen at least one reference to NF_ENDDEF with one underscore (which means no padding) in the drifters package.

Sat, 3 Mar 2007

parallel computing: a new distributed OS?

Limbo programming language and the Inferno OS

Sat, 3 Mar 2007

emacs: setting muse project headers on a per-project basis

How do I set muse-html-header on a per-file or per-project basis?

This doesn't work...

(add-hook 'muse-before-publish-hook
          '(lambda ()
;;              (setq muse-html-header
             (message
                   (concat (file-name-directory (buffer-file-name (current-buffer))) "header.html"))
;;              (setq muse-html-footer
             (message
                   (concat (file-name-directory (buffer-file-name (current-buffer))) "footer.html"))
             (setq muse-html-table-attributes "class=\"muse-table\"")))

Instead I have

(setq muse-html-header "header.html")
(setq muse-html-footer "footer.html")

which seems to create confusion if I'm editing multiple projects simultaneously...

It also appears I need to set

(setq muse-journal-rdf-base-url "http://cobweb.gfdl.noaa.gov/~vb/weblogs/")

... which it appears you can also set in the project-alist as :header ...

but how to make this come out different per-project?

So, did I finally figure out how to set the project-alist?

(setq muse-project-alist
      '(("gfdlweb"                      ; GFDL public web
         ("~/muse/gfdlweb" :default "index")
         (:base "html":path "~/external_html"
                :header "~/muse/gfdlweb/header.html"
                :footer "~/muse/gfdlweb/footer.html"))
;;          (:base "pdf" :path "~/external_html/pdf"))
        ("weblog"                       ; weblog on cobweb
         ("~/muse/cobweb/weblogs" :default "journal")
         (:base "journal-html" :path "~/internal_html/weblogs"
                :header "~/muse/cobweb/weblogs/header.html")
                :footer "~/muse/cobweb/weblogs/footer.html")
         (:base "journal-rdf"  :path "~/internal_html/weblogs"
                :base-url "http://cobweb.gfdl.noaa.gov/~vb/weblogs/"))
        ("cobweb"                       ; GFDL internal web
         ("~/muse/cobweb" :default "index")
         (:base "html" :path "~/internal_html"
                :exclude "weblogs"
                :header "~/muse/cobweb/header.html"
                :footer "~/muse/cobweb/footer.html"))
;;          (:base "pdf" :path "~/internal_html/pdf"))
        ("web"                          ; Princeton web
         ("~/muse/web" :default "index")
         (:base "html" :path "~/public_html"))))
;;          (:base "pdf" :path "~/public_html/pdf"))))

Seems to work!

Sat, 3 Mar 2007

emacs: htmlize

The muse documentation says that using htmlize we are able to process <src lang="foo"> but it doesn't seem to work, perhaps because we have an older version.

To invoke htmlize you seem to need

(add-to-list 'load-path
             "/usr/share/emacs/site-lisp/emacs-wiki/contrib")
(require 'htmlize)

htmlize produces nice-looking output, but by default it's a complete HTML file, with header, style info, and body. Need to figure out how to embed it within muse.

Tue, 27 Feb 2007

emacs: muse web doc doesn't quite match my version

I now have a working setup and some homepages on cobweb and gfdl.

Still need to find out if princeton web will accept php...

other minor tweaks: necessary.

One issue is that muse-el from ubuntu dapper is not the latest on the muse-mode website... however the version found there does not install cleanly on dapper, and besides, does not validate my muse.

What about magpierss? I've fixed header.html so it points to ../magpierss instead...

Mon, 26 Feb 2007

emacs: muse musings

Ok, so I seem to have a working muse setup, I now have

directory gfdlweb for publishing to the GFDL web in the external_html directory
directory cobweb for publishing to the GFDL internal web in the internal_html directory
directory web for publishing to the Princeton web in the public_html directory

An additional directory cobweb/weblogs is published using journal-html and journal-rdf styles (journal-rss appears to have a bug). This is encoded in .emacs.tlemcen as follows:

(setq muse-project-alist
      '(("gfdlweb"                      ; GFDL public web
         ("~/muse/gfdlweb" :default "index")
         (:base "html" :path "~/external_html"))
;;          (:base "pdf" :path "~/external_html/pdf"))
        ("weblog"                       ; weblog on cobweb
         ("~/muse/cobweb/weblogs" :default "journal")
         (:base "journal-html" :path "~/internal_html/weblogs")
         (:base "journal-rdf"  :path "~/internal_html/weblogs"))
        ("cobweb"                       ; GFDL internal web
         ("~/muse/cobweb" :default "index")
         (:base "html" :path "~/internal_html" :exclude "weblogs"))
;;          (:base "pdf" :path "~/internal_html/pdf"))
        ("web"                          ; Princeton web
         ("~/muse/web" :default "index")
         (:base "html" :path "~/public_html"))))
;;          (:base "pdf" :path "~/public_html/pdf"))))

Order appears important: weblog must precede cobweb, above.

Thu, 22 Feb 2007

FRE: Changes to FRE

Amy and I are proposing some <a href="{url}071114.html">changes to FRE</a> in response to some of the most requested features: namely, avoiding compiling the model components where you do not expect to modify the source, and second, the ability to compile multiple experiments from the same source.

The first of these involves certain changes to FRE syntax, and also, unfortunately, certain changes to the repository, both of which are explained in this entry. A decision has to be made as to whether to do these now.

The major change to FRE syntax involves a reordering of the XML node tree, so that <component> is now a high-level tag. All the operations are now organized by component (as they are already for post-processing). The key advantage of the current proposal is that now checkout and compile instructions are organized by component as well: this means component developers testing within a coupled model configuration need only checkout and compile the component they are interested in, and link to pre-compiled libraries for the other components.

   <component name="fms" paths="shared">
      <compile>
         <cppDefs>-DSPMD -Duse_libMPI -Duse_netCDF -Duse_shared_pointers -Duse_SGI_GSM</cppDefs>
      </compile>
   </component>
   <component name="atmos_phys" paths="atmos_param" requires="fms">
      <compile>
         <cppDefs></cppDefs>
      </compile>
   </component>
   <component name="atmos_dyn" paths="atmos_coupled atmos_fv_dynamics
      atmos_shared" requires="fms atmos_phys">
      <compile>
         <cppDefs>-DSPMD -Duse_shared_pointers -Duse_SGI_GSM</cppDefs>
      </compile>
   </component>
   <component name="ice" paths="ice_amip ice_param" requires="fms">
      <compile>
         <cppDefs></cppDefs>
      </compile>
   </component>
   <component name="land" paths="land_lad land_param" requires="fms">
      <compile>
         <cppDefs>-DLAND_BND_TRACERS</cppDefs>
      </compile>
   </component>
   <component name="ocean" paths="ocean_amip" requires="fms">
      <compile>
         <cppDefs></cppDefs>
      </compile>
   </component>
   <component name="coupler" paths="coupler" requires="ocean land ice atmos_dyn fms">
      <compile>
         <cppDefs>-DLAND_BND_TRACERS</cppDefs>
      </compile>
   </component>

Each <component> now has its own <cppDefs>, as well as <mkmfTemplate>, etc if desired.

Please note the following advantages:

CPP macros are only applied to the component where they are relevant;
For debugging one component, you could potentially compile all other components at optimization and only this one with -O0 -g.

An even more useful feature is that the compilation of components can be skipped entirely, by pointing to an existing component library. Just as the <executable> tag allows one now to skip compilation by invoking an existing executable, the <library> tag will allow one to skip the compilation of an existing component. For instance

   <component name="ocean" paths="ocean_mom4" requires="fms">
      <library path="/home/fms/lib/nalanda/libmom4.a" include="/home/fms/lib/nalanda/include/mom4">
      <compile>
         <cppDefs></cppDefs>
      </compile>
   </component>

will entirely skip compiling the ocean component, but instead invoke the library /home/fms/lib/nalanda/libmom4.a at the linker stage. The ability to perform selective compilation, skipping especially the shared code, but many other components as well, is probably the most desired feature in FRE. (Along with multiple compiles from a single source, on which more later...)

One complication that arises is the include attribute of the library. This is an attribute that is required for components coded in F90/F95. F90 compilers store module information in a .mod file (and I haven't ever figured out compiler writers can't just bundle this information into the .o file). The .mod files will be required in order to process use statements in higher-level modules.

The include directory can be correctly processed using a -I flag, but the current setup does not apply this flag to .f or .f90 files.

One possibility we have often considered is to rename all files in our repository to .F90. This can be done without losing the CVS history of the file, but if you have an existing checked out .f90 file and you attempt to update it, the update will likely fail. This is true even for pre-existing tags.

So, question for liaisons: how mad would people around the lab get, if along with the nalanda release, we did mass file renaming?

A second, unrelated change to the repository arises from an error in structure noted in the course of the FRE rewrite. There are modules in the coupler directory called atmos_ocean_fluxes and coupler_types. These are =use=d by the component models, ocean and so on. However the coupler is part of the superstructure and is supposed to sit above the models in the component hierarchy, and thus get compiled later. I'd like to propose moving these two modules to a new shared/coupler directory.

Tue, 6 Feb 2007

grids: Gridspec status

The files in /archive/z1l/test_xgrid/tripolar1DXregular2.5Dx2D contain examples of a mosaic consisting of a tripolar grid, a cube-sphere grid mosaic (and the exchange grid between?)

Sun, 14 Jan 2007

FRE: notes on the FRE rewrite

The redesign of FRE involves both a refactoring and modularization of the code, and an evolution of the XML syntax.

Code restructuring

A first cut at the restructured code is seen in /home/vb/src/perl, with tools fresrc, fremake, etc using the module FRE.pm. Site defaults are in the file /home/vb/src/perl/site/fre.cshrc but isn't quite properly configured to accept overrides from a setup tag in the FRE.

FRE.pm is object-oriented: each FRE XML file in unpacked into a new pbject called a fre. In most instances (in fact all, so far) the script using FRE.pm will only create one fre. But in principle, one could imagine a tool using several fres: for example, to allow inheritance across FRE files.

Within a fre, each experiment is unpacked into a hash. The key of the node is the experiment name: the value of the hash is the expt node. The experiment name is the base key for all hashes: thus, even in a script spanning multiple =fre=s, we require that experiment names be unique. (We could relax this requirement if needed by constructing a unique key from the concatenation of the fre name and the experiment name, i.e the key of the expt hash.)

All the information below the experiment node is maintained in the node that's returned as the value of the expt hash. The great advantage of XML::LibXML is that there's no need to unpack the XML hierarchy very deeply: you create unique keys at some high level, and query for the rest.

Changes to FRE file syntax

The principal elements of the new syntax involve one major restructuring: everything under the experiment tag is now configured by component. Most of the other changes are new and more general synonyms for existing elements, e.g the cvs node is now replaced by a checkout node, whose type attribute can have values cvs, svn, etc.

The component tag now appears only around the compile node. (It already appeared as an element inside the postProcess node...) Each component can have its own version of compile, with its own cppDefs, mkmfTemplate, srcList, and csh tags. The output of compilation of one component is an object library (currently static only, i.e .a not .so). The final stage of compilation does the linking of libraries as a separate step.

A complication arising of F90 header files (.mod files) is that components can only be compiled in a certain order. In general, child components must be compiled prior to the parents. The infrastructure component is a "universal child" and the superstructure is the top-level parent, to be compiled last. Since FMS has a relatively flat structure, we only need to designate the infrastructure and the superstructure, which we do using the role attribute. In future, we'll supply a requires attribute, which will specify dependencies. In fact I'll go do that now...

Thu, 21 Dec 2006

ESMF: Iredell's proposal: ~/doc/ms/gfdl/pilot3a.doc for pilot proj

- does not mention FV cubed sphere? - timeline for ocean model component of hurricane (III?) If timeline

is right, could GUOM be a contender alongside HYCOM? - post-processor component or service?

Mon, 20 Nov 2006

Curator: Talks on Curator

The <a href="talks/curator.pdf">first talk</a> I can remember giving on Curator is my first proposal that this would be a natural development from ESMF, at the <a href="http://www.esmf.ucar.edu/main_site/meeting_summaries/mtg_0305_commmtg.html"> 2nd ESMF Community Meeting</a> in Princeton, May 2003.

<a href="talks/apan2004b.pdf">Talk</a> given at <a href="http://apan.net/meetings/honolulu2004/">APAN eScience Workshop</a>, January 2004: describes how Curator flows naturally from current practice in climate modeling.

<a href="talks/esp2004.pdf">First presented</a> to <a href="http://go-essp.gfdl.noaa.gov/">GO-ESSP</a> (then called ESP) community, at <a href="http://data1.gfdl.noaa.gov/~hap/go-essp/meetings/06_08_04/agenda_presentations.html"> ESP meeting</a> in Princeton, June 2004.

<a href="talks/curator_gridmeta2005.pdf"> First talk</a> post-award, at <a href="http://www.esmf.ucar.edu/main_site/meeting_summaries/mtg_0507_commmtg.html"> ESMF Community Meeting</a> in Cambridge, MA, July 2005. Also introduces the grid metadata.

<a href="talks/curator_gridmeta2005.pdf"> Similar talk</a> to <a href="http://www.cisl.ucar.edu/dir/CAS2K5/index.html">CAS Workshop</a> in Annecy, September 2005.

<a href="talks/curator-prism2005.pdf"> Introducing Curator to PRISM</a> at <a href="http://www.prism.enes.org/news_meetings/meetings/CommunityMtg2005/minutes.php"> First PRISM Community Meeting</a>, Toulouse, November 2005. Also a <a href="talks/prism2005.pdf">keynote address</a> at the same meeting intoduces the use cases.

Overview of <a href="talks/metadata.pdf">ESMF-ESC metadata activities</a> given to PRISM <a href="http://www.prism.enes.org/news_meetings/meetings/MetadataMtg_May2006/minutes.php"> metadata meeting</a> in Exeter, May 2006.

This talk to an NSF <a href="http://www.sdsc.edu/PMaC/GeoScience_Workshop/"> workshop on Petascale computing in the Geosciences</a> in San Diego, April 2006; presents <a href="talks/petageo.pdf"> Curator as part of an integrated approach to the petascale</a> looking at models, data, and multi-model campaigns.

Talks at the <a href="http://data1.gfdl.noaa.gov/~ck/go-essp/presentations/06_19_06/agenda_presentations.html"> GO-ESSP Meeting</a> in Livermore CA, June 2005, covered the <a href="talks/esp2006.pdf"> outlook for IPCC AR5 and beyond</a>, as well as a detailed look at the <a href="talks/gridmeta2006.pdf"> draft grid metadata standard</a>.

More outreach beyond climate modeling: <a href="talks/modest7c-balaji-esmf.pdf"> this talk</a> was solicited by an astrophysics community applying frameworks to stellar dynamics models, at the <a href="http://www.manybody.org/modest/Workshops/modest-7c.html"> MODEST-7c</a> workshop in Philadelphia, September 2006. <a href="talks/crrc-balaji.pdf"> Another talk</a> later that month to a community interested in real-time response to coastal disasters, at the <a href="http://www.crrc.unh.edu/fall_institute/"> CRRC Fall Institute</a> in Durham NH, September 2006.

The <a href="http://www.earthsystemcurator.org/index.php?option=com_content&task=view&id=30&Itemid=65"> first ESC Meeting</a> attained <a href="http://hotitems.oar.noaa.gov/storyPrint_org.php?sid=3759"> some notoriety</a>.

Fri, 17 Feb 2006

parallel computing: FV runs Held-Suarez

Modifications to make Held-Suarez solo driver for FV work with the new code. This is the experiment HSfvd in idealized.xml.

atmos_nudge.f90 which is now in atmos_fv_dynamics/driver/coupled needs a new home, so that driver/solo can also use it.

The version in atmos/shared seems to be "dead"... it's not in any module and I think Bruce has killed it. Perhaps the version in atmos_fv_dynamics/driver/coupled could replace it?

Otherwise it could go in atmos_fv_dynamics/tools?

shared/data_override and shared/time_interp need to be included in the CVS module fms_fv_dynamics_solo.

driver/solo/atmosphere.f90 modified and renamed to atmosphere.F90

driver/solo/fv_phys.F90 modified

fv_pack modified to publish two more variables needed by solo driver.

atmosphere.F90, fv_phys and fv_pack are now tagged lima_20060217_vb.

Thu, 9 Feb 2006

FMS: FV core gets testing tag

Ok, it's going into testing... finger crossed.

mpp_pset.F90 did not work on Origin: it's now been fixed. (And it even works!) Changes:

the use_SGI_GSM flag now can be turned on for Origin.

if use_SGI_GSM is on, mpp_pset_init asserts that SMA_GLOBAL_ALLOC must be on.

mpp_translate_remote_ptr does nothing on Irix... SMA_GLOBAL_ALLOC means no translation required.

There was one place where a Cray pointer was being passed to an integer(POINTER_KIND), which MIPSpro doesn't like.

Now we have

#ifdef sgi_mipspro
    real :: dummy
    pointer( ptr, dummy )
#else
    integer(POINTER_KIND), intent(in) :: ptr
#endif

instead.

Also may need to correct fv_physics and atmosphere.f90... looking into it.

Thu, 9 Feb 2006

SciDAC proposal notes

SciDAC proposal notes:

At the current time (2005-2015) the principal mode of advancement in climate modeling is by the study of a process across many models: multi-model ensembles, where we achieve many independent realizations of a simulation to construct a PDF. This involves very large datasets (Tb-Pb) created at sites distributed around the world, requiring to be analyzed on a common footing:

- Petabyte-scale storage by itself is achievable, but delivery over

the network is a problem. Need a fresh look at compression and analysis of large data stored on a network of distributed servers.

- since storage isn't an issue we can look at both lossy and lossless

compression, the original data is still archived.

- simple compression techniques that aren't aware of the content of

the bits can be improved upon: 1) where FP numbers can be identified, specialized compression can be applied (dynamic range of exponent bits much smaller than mantissa bits) 2) knowing the file contents to be gridded physical fields, multi-grid (or PCA or wavelets) or other methods can be applied to store the dataset as an overlay of several files whose size scales inversely with wavenumber. "Domain-aware compression".

- AMR and nested models: explore the extension of techniques to

complex grid mosaics.

- techniques for manipulation of remote data, expressing and applying

sophisticated computation server-side;

Prerequisites:

- standards for describing complex grid mosaics, development of

regridding algorithms on mosaics.

- federation of climate data archives around the world.

Some work underway in other funded projects on the prerequisites, but that work isn't complete. I'm not sure whether to provide these as linked efforts or base work on this grant.

Wed, 1 Feb 2006

FMS: FV core ready for testing?

resuming the FV weblog entry... will try and patch the HTML over to the wiki.

The new FV core is ready to be introduced into the testing code stream. All required changes are within the directory atmos_fv_dynamics, plus a new file in the shared/mpp directory, mpp_pset.F90. PSET stands for Persistent Shared-memory Execution Thread and is the implementation of shared-memory on Altix and Origin (so far). I'll be giving a lunchtime seminar on PSETs in about a month, if you're interested in how it's done.

All files are tagged lima_20060131_vb.

Instructions for moving the `testing` tag:

In shared/mpp, checkout mpp_pset.F90 and apply testing tag.

In atmos_fv_dynamics many files will not be in the release.

These files will disappear:

model/benergy.f90
model/cd_core.F90
model/d2a3d.F90
model/drymadj.f90
model/geo_map.F90
model/geop_d.F90
model/geopk.f90
model/mapz_module.f90
model/p_var.f90
model/pkez.f90
model/polavg.f90
model/te_map.F90
tools/read_fv_rst.F90
tools/upper.F90
tools/write_fv_rst.F90

These files will remain:

driver/coupled/atmos_nudge.f90
model/dyn_core.F90
model/ecmfft.f90
model/fill_module.f90
model/fv_dynamics.F90
model/fv_pack.F90
model/pft_module.F90
model/shr_kind_mod.f90
model/sw_core.F90
model/tracer_2d.F90
model/update_fv_phys.F90
tools/age_of_air.F90
tools/fv_diagnostics.F90
tools/getmax.F90
tools/gmean.F90
tools/init_dry_atm.F90
tools/init_sw_ic.F90
tools/mod_comm.F90
tools/par_vecsum.F90
tools/pmaxmin.F90
tools/pv_module.F90
tools/set_eta.f90
tools/timingModule.F90

These files are new:

model/fv_arrays.F90
model/fv_arrays.h
model/fv_point.inc
model/mapz_module.F90
tools/fv_restart.F90

These files need to be renamed from .f90 to .F90:

driver/coupled/atmosphere.f90
driver/coupled/fv_physics.f90
model/tp_core.f90

Note that model/mapz_module was already changed from .f90 to .F90 between lima and memphis, in the code I inherited...

The easiest way to get the testing tag on the right files, I think is this:

cd atmos_fv_dynamics
cvs tag -d testing
cvs update -r lima_20060131_vb
cvs tag testing

... but there will still be the issue of the files whose names need to be changed.

Compiling and running:

The code requires one set of flags for reproducing Lima answers, and another set for the new code, which will become the standard version shortly, after the usual stringent tests, climate runs and so on. I am going to call the new version the Memphis version in this document, even though it isn't officially sanctioned yet.

Running the lima version:

To run the Lima version, use the following flags in the cppDefs XML tag;

-DSPMD -Duse_libMPI -Duse_netCDF -DUSE_LIMA

This is supposed to reproduce answers against any current run using FV, but I've only tested it for m45_am2p13, and only on the Altix.

You need to set fv_core_nml as follows:

<namelist name="fv_core_nml">
 nlon=144, mlat=90, nlev=24, ncnst=4,
 consv_te = 0.7, layout=1,\$npes
</namelist>

(If you're running a concurrent coupled model, the value of layout(2) of course is no longer \$npes but whatever atmos_npes is...)

Running the memphis version:

Your running the new version is not a requirement for Memphis testing, as we don't have official reference runs yet. However, if you are curious about the shared-memory stuff, here's how to use it.

To run the new version on Altix, use:

-DSPMD -Duse_libMPI -Duse_netCDF -Duse_shared_pointers -Duse_SGI_GSM

The numbers specified in the layout argument of fv_core_nml specify the PSET count and the MPI count. Typically, I set the MPI count to 15 or 30, and let it pick the PSET count from \$npes.

For example

<namelist name="fv_core_nml">
 nlon=144, mlat=90, nlev=24, ncnst=4,
 consv_te = 0.7, layout=0,15
</namelist>

can be run on 1,2,3 or 6 threads, using 15, 30, 45 or 90 PEs.

The reason for choosing 2, 3 or 6 threads but not 4 or 5 is that thread-parallelism is mostly applied in the k or j direction within the FV core, and along i in the AM physics. So I chose numbers that divide nlon, mlat/15, and nlev exactly.

The size of the physics window is set in atmosphere_nml. The code currently requires that the physics window divide the 2D domain decomposition exactly. For instance, in the example above the 6x15 distribution yields a 2D domain decomposition for physics that's 24x6. It's best to pick one that divides 24x6... that will also work for 3 threads (48x6) or 2 threads (72x6). For instance:

<namelist name="atmosphere_nml">
  physics_window = 24,1
</namelist>

Setting physics_window to (0,0) yields a window that fills the whole domain, so that the window loop only loops once, so that's been set as the default.

Here is a reference run for m45_am2p13 using the memphis version... as I said, this isn't officially sanctioned yet. Answers match on any thread count, of course only if -fltconsistency is used.

 <reference restart="/archive/vb/fms/lima_vb/rts/ia64/
 m45_am2p13_shpbase/1x0m8d_30pe/restart/19820109.cpio"/>

Tue, 8 Nov 2005

FMS: FV core mod_comm changes

Transformation of mod_comm to 2D: it's now written so that mp_init alone is called by the ypelist (one PE per latitude band). Every other routine can be called by the whole pelist, but PEs not in ypelist will exit the routine immeidately. How?

- added variable no_mod_comm, default TRUE, set FALSE at the top of mp_init. Every routine other than mp_init has as its first line

if( no_mod_comm )return

mp_init is now called from fv_arrays_init, not fv_init:

!initialize mod_comm on the ypelist
    call mpp_declare_pelist(ypelist)
    if( allocator )then
        call mpp_set_current_pelist(ypelist, commID=commID)
        call mp_init( nx, ny, nz, commID )
    end if
    call mpp_set_current_pelist(pelist)

fv_arrays_allocate is eliminated, this is now done by fv_arrays_init.

Public variables of mod_comm that must be set correctly outside mod_comm: gid, numpro, numcpu, yfirst, ylast. The last 4 in fv_init, yfirst/last are for prints, numpro/numcpu are for n_zonal.

Moved n_zonal into fv_arrays_init: no external dependence numpro/numcpu

Eliminated yfirst/ylast prints from fv_init, but they still need to be initialized internally, so y_decomp is still called.

gid is now == mpp_pe, does anyone require it to be 0 on master?

numpro/numcpu now silenced. (there were unused references to them in fv_dynamics, eliminated)

Need to add layout to fv_core_nml

In this version all mp_* calls only work on shared arrays? added two new shared arrays: penorth in fv_dynamics, cymax in tracer_2d

Possible problems routines: par_vecsum.F90 calls mp_sum1d, replace with mpp_sum?: callers gmean, mapz_module

fv_restart calls mp_bcst_* on non-shared scalars/arrays: replace with mpp_broadcast?

getmax calls mp_reduce_max, replace with mpp_max? called by timingModule.F90

Tue, 25 Oct 2005

FMS: FV core, minor changes

Some name changes and reorg is necessary in fv_arrays_mod: fv_arrays_allocate (plural) to be merged into fv_arrays_init fv_array_allocate to generate address without communication (since fv_stack is already a shared stack). Instead, call fv_array_check on it so that you get a noop when #ifndef debug_shared_pointers

Remove the len argument to fv_array_allocate, instead make that a module global, and add a new routine fv_stack_reset which is called once per timestep, from fv_dynamics.

Thu, 20 Oct 2005

FMS: FV core, pointer shortcomings

It turns out that the use-associated pointer cannot directly be applied to a Cray pointee. The test code shown here will fail to inherit the pointer... however when you define the pointer pp and assign the value p to it (see commented lines), it works:

module test_p

  implicit none
  private
  integer(8), public :: p
  public :: make_a
contains
  subroutine make_a
    real, allocatable, save :: a(:)
    allocate( a(100) )
    call random_number(a)
    p = loc(a)
  end subroutine make_a
end module test_p

program test
  use test_p, only: p, make_a

  real :: b(100)
  pointer(p,b)
!  pointer(pp,b)

!  pp = p
  call make_a
  print *, b
end program test

Wed, 19 Oct 2005

FMS: FV core works

Ok, now the lima_vb code without the sharedptr flag exactly matches the shpbase code:

ic1 9:35pm> dmget
/archive/vb/fms/lima_vb/rts/ia64/m45_am2p13_shpbase/1x0m8d_15pe/restart/19820109.cpio
/archive/vb/fms/lima_vb/rts/ia64//m45_am2p13_lima_vb/1x0m8d_15pe4/restart/19820109.cpio
ic1 10:01pm> ls -1
/archive/vb/fms/lima_vb/rts/ia64/m45_am2p13_shpbase/1x0m8d_15pe/restart/19820109.cpio
/archive/vb/fms/lima_vb/rts/ia64//m45_am2p13_lima_vb/1x0m8d_15pe4/restart/19820109.cpio | resdiff
193408 blocks
193408 blocks
/// /archive/vb/fms/lima_vb/rts/ia64//m45_am2p13_lima_vb/1x0m8d_15pe4/restart/19820109.cpio
\\\\\\ /archive/vb/fms/lima_vb/rts/ia64/m45_am2p13_shpbase/1x0m8d_15pe/restart/19820109.cpio
        Comparing atmos_coupled.res.nc...
        Comparing coupler.res...
        Comparing fv_rst.res...
        Comparing fv_srf_wnd.res...
        Comparing ice_model.res.nc...
        Comparing mg_drag.res.nc...
        Comparing ocean_model.res.nc...
        Comparing physics_driver.res.nc...
        Comparing radiation_driver.res.nc...
        Comparing radiative_gases.res.nc...
        Comparing soil.res.nc...
        Comparing strat_cloud.res.nc...
        Comparing vegetation.res.nc...

Note that this was produced with the 1x0m8d_15pe4 script.

The 1x0m8d_15pe5 script is now testing it with the shared pointers turned on, but a single thread.

Executable is in ~/fms/lima_vb/rts/ia64/m45_am2p13_lima_vb/shp compiled with

fvmk -DSPMD -Duse_libMPI -Duse_netCDF -Duse_shared_pointers -Ddebug_shared_pointers

Tue, 18 Oct 2005

FMS: FV core, fv_domain closed

Working now after Gerardo's help and a few other minor bugfixes

Next is to correct Will's read_fv_rst references to fv_domain.

Close off fv_domain.

Tue, 11 Oct 2005

FMS: FV core, debugger problem

trouble debugging in totalview... no symbol table seems to be created for read_fv_rst, which is not in a module.

Cannot simply make it into a module, because then there is use- circularity betwene fv_pack and this module.

Better is to create a new module fv_restart_mod, containing read_fv_restart, write_fv_restart, fv_restart.

This module should be used/called from atmosphere_init/end, not fv_init/end. (right after fv_init and right before fv_end)

Sun, 9 Oct 2005

FMS: FV core, almost there

Ok the bulk of the code changes look complete.

Next pause and take stock, see if M45 runs ok on 1x15 and 1x30 All experiments in

Experiments: each is run 8d at 1x15 1x30 and 4x2d at 1x15

<table summary="Regression test table for FV SHP validation" border> <caption> Experiments to validate shared pointers based on the m45_am2p13 RTS. Hover on the column header for additional info. Hover on the experiment to get the archive directory (that you can pass to frecheck -c, for instance). </caption> <tr>

<th>Name</th> <th title="relative to /home/vb/fms: XML file is in .../rts/am2.xml">Root</th> <th title="applied as an update relative to lima">Tag</th> <th title="also -Duse_libMPI -Duse_netCDF on all">CPP flags</th> <th>Comments</th> <th>Status</th> <tr> <td title="/archive/vb/fms/lima/rts/ia64">lima</td> <td class="code">lima</td> <td>lima</td> <td class="code">-DSPMD</td> <td>Baseline from lima</td> <td title="passes RTS">ok</td> <tr> <td title="not used for frecheck">lima_vb</td> <td class="code">lima_vb</td> <td>lima_vb (branch tag: <b>unstable!</b>)</td> <td>various</td> <td>branch code used for quick testing</td> <td>ok</td> <tr> <td title="/archive/vb/fms/testing/rts">testing</td> <td class="code">testing</td> <td>testing</td> <td class="code">-DSPMD</td> <td>Baseline from testing</td> <td title="passes RTS; matches lima">ok</td> <tr> <td title="/archive/vb/fms/lima_vb/rts/ia64">shpbase_lima</td> <td class="code">lima_vb</td> <td>lima_shpbase_vb</td> <td class="code">-DSPMD -DUSE_LIMA</td> <td>Baseline merged code from kkg, matching lima</td> <td title="passes RTS; matches lima">ok</td> <tr> <td title="/archive/vb/fms/lima_vb/rts/ia64">shpbase</td> <td class="code">lima_vb</td> <td>lima_shpbase_vb</td> <td class="code">-DSPMD</td> <td>Baseline merged code from kkg, matching lima_sjl</td> <td title="passes RTS">ok</td> <tr> <td>shpdevel</td> <td class="code">lima_vb</td> <td>lima_shpbase_vb</td> <td class="code">-DSPMD -Duse_shared_pointers -Ddebug_shared_pointers</td> <td>Baseline from testing</td> <td>nil</td> </table>

SPMD: we could leave it in place and call the mp_ routines with the pe subset? if they all call it's going to be a problem in some places. The 4D array transfers have an OMP loop that could become thread-parallel.

Sat, 8 Oct 2005

FMS: FV code flags

The code as it stands compiles with a variety of switches: I've tried

-DSPMD                        ("original")
-DSPMD -Duse_shared_pointers
" "                           ("new code compiles without shared pointers")
-Duse_shared_pointers         (" the target: replace SPMD calls")

Add the following settings to your .cshrc:

# main fv source directory
set fvsrc = ~/fms/lima_vb/rts/ia64/m45_am2p13_lima_vb/src
set fvdir = \$fvsrc/atmos_fv_dynamics/model
# exec directory without use_shared_pointers
set fvnoshpx = ~/fms/lima_vb/shp/no_shared_pointers
# exec directory with use_shared_pointers
set fvshpx = ~/fms/lima_vb/shp/use_shared_pointers
# pathnames file
set fvpaths = ~/fms/lima_vb/shp/path_names
# alias for making, pass CPPDEFS in args
alias fvmk 'mkmf -t ~/chepauk/mkmf.template.chepauk -c"\!*" \
  -a \$fvsrc \$fvpaths /usr/include/netcdf shared/mpp/include'

Then you can compile any of the above by using fvmk and make:

fvmk -DSPMD -Duse_shared_pointers
make

Some routines yet to be parallelized, the original list is:

 List of 39 OMP-parallel routines: init_dry_atm pmaxmin
  geopk get_height_given_pressure get_pressure_given_height
  pv_entropy pkez tracer_mass fv_init Ga_Get4d_i4 hydro_eq
  geop_d Ga_Put4d_i4 Ga_Put4d_r4 te_map benergy fv_dynamics
  drymadj add_tracers fv_diag mp_reduce_max update_fv_phys
  omp_start get_bottom_mass p_var Ga_Get4d_r8 vort_d cd_core
  read_fv_rst geo_map Ga_Put4d_r8 fv_restart p_energy
  age_of_air maxmin_global Ga_Get4d_r4 tracer_2d_lima
  compute_vdot_gradp d2a3d pmaxming

Next, pmaxmin.F90: pmaxming is easy (copying a halo array into a naked array)

pmaxmin dimensions things (im,jt) where jt is JxK. We need to calculate the division... add a routine fv_array_limits to fv_arrays.

Still need to do the mpi_reduce... mpp_max, min.

prt_maxmin_local can be parallelized also, but isn't in the original.

pv_module.F90: OK

te_map has the =ixj=1,jp= loop... use fv_array_limits to calculate the loop limits on these... this needs n_zonal from fv_pack to be calculated. Modified fv_array and fv_pack

cd_core:

update_fv_phys: updated argument lists to remove u_dt/etc... isn't it better to have them still in the arglist, and mpp_malloc them in update_phys_up?

Fri, 7 Oct 2005

FV

mapz_module.F90 completed: te dz are actually reuse of shared arrays ua and va. Also ps_bp is directly use-associated in the fv_arrays.h method in geo_map and te_map. Both of these are only called by fv_dynamics.

There is also an erroneous reference to the variable gid outside #ifdef SPMD. Have initialized =gid=0 #ifndef SPMD= in mapz_module:

fv_dynamics.F90: now along with ps_bp, we also need to use-associate the *_phys shared arrays.

Thu, 6 Oct 2005

FMS: FV details

mapz_module: stick with the argument list for now. But it's been shown that specifying start and end indices in module procedures can force copies... may need to change the way arguments are passed, or switch to use association.

routine te_map:

Lots of SPMD message-passing to clean up.

Need to figure out what pem/tte0/hs are in the calling routine fv_dynamics, required to be parallel.

the 2000 loop is odd... not parallelized. Everyone executes over whole space...!

Too many switches between OMP k-loops and j-loops, could be cleaned up. 2 j-loops are split by a call to par_vecsum at L687

k-loop at L751 is kmap,km... I changed it to max(ksp,kmap),kep!! should be OK, no?

Not parallelizing loop at L807

geo_map: not parallelizing 2000 and 4000 loops

Similarly in routine pkez, L2506 loop is deferred parallelization, but the comment there gives a hint as to how to fix it.

benergy stalled: te/dz were local arrays in the old (Lima) version, but now are arguments. It also appears that fv_dynamics calls benergy passing ua/va here... why?

Also ps_bp is one shared array that is use-associated while the rest are args... why?

Wed, 5 Oct 2005

FMS: FV details (minor)

Just following the makefiles path won't do...

init_dry_atm: changed POLVANI read to mpp_open

did NOT duplicate changes to USE_LIMA version

fv_physics: still need to do windows logic properly, take from 1D2D

fv_diagnostics:

atmosphere.f90: get_bottom_mass and wind, should use assumed shape arrays and offsets ip,jp these routines are called with unshared arguments, shouldn't be parallelized (they were called outside the original calling tree, that's why they show up as orphaned)

read_fv_rst.F90: did not touch the I/O bit, Will has rewritten it, merge

corrected routines read_fv_rst and add_tracers

write_fv_rst.F90: corrected how arrays are acquired, MERGE IO from Will.

maxmin_global is NOT parallelized, uses unshared arrays.

mapz_module.F90: complicated interactions with fv_dynamics, save for tomorrow.

Tue, 4 Oct 2005

FMS: FV core, added fv_array_check

Added a routine fv_array_check, which verifies if an array is a shared array. The check is only performed #ifdef debug_shared_pointers. The is needed because the check requires an mpp_sync... should not be used in production.

This call is added in the preamble, where we expect shared arrays to be passed in through the argument list. (e.g d2a3d).

Current syntax requires Cray pointers (actually the LOC function). This should perhaps be cleaned up later.

fv_pack.F90: finished (both with and without USE_LIMA)

use mpp_malloc for local (auto) shared arrays

use fv_array_check for shared arrays through the argument list

use #include "fv_arrays.h" for use-associated arrays

still need to replace SPMD

next in compilation sequence is update_fv_phys

update_fv_phys.F90:

add {t,q,u,v}_dt to fv_arrays.h and fv_arrays.F90 use the include method... argument list changed to eliminate _dt shared arrays

What is du_s? There is a recv but no corresponding send

this routine uses beglon/endlon/beglat/endlat, which is redundant with is/ie/js/je, probably should be replaced: scope for error/mismatch. Or add error check in fv_init.

Note use of mpp_sync in this routine: this is because we need to sync inbetween an OMP k-loop and a j-loop.

tp_core.f90 needs to become .F90! currently uses special command in path_names file, which for some reason is not used in the makefile. I edited the Makefile by hand!
20050929: FMS: FV core shared pointers

init_dry_atm: arrays are use-associated
              replace read(61)+scatter with parallel read
pmaxmin: looking to see if all instances of the main array are shared
              arrays... exceptions:
              fv_diag:529(a2)
              fv_diag:770(age)

- - - - - - -
=Annotated tree of OpenMP-containing subroutines=

get_bottom_mass in file driver/coupled/atmosphere.f90:397: (no callers, no calls)

tracer_mass in file model/fv_pack.F90:1965: (no callers, no calls)

add_tracers in file tools/read_fv_rst.F90:443: (no callers, no calls)

fv_init in file model/fv_pack.F90:294:
   array args: none
 calls: fv_arrays_init
        set_fv_geom
        pft_init
        fftfax
        fv_arrays_allocate
        tm_set_tracer_profile
        fv_restart
 is called by: atmosphere_init

fv_restart in file model/fv_pack.F90:819:
   array args: none
 calls: init_sw_ic
        set_eta
        init_dry_atm
        read_fv_rst
        check_eta
        d2a3d
 is called by: fv_init

read_fv_rst in file tools/read_fv_rst.F90:3:
   array args: none [1 omp directive - OK]
 calls: set_eta
        get_number_tracers
        get_tracer_indices
        get_tracer_names
        set_tracer_profile
        pmaxmin
        pmaxming
        p_var
        d2a3d [with km=1]
 is called by: fv_restart:901

init_dry_atm in file tools/init_dry_atm.F90:478:
   array args: none
 calls: p_var
        jet2d_symm
        hydro_eq
        d2a3d [with km=1]
 is called by: fv_restart

hydro_eq in file tools/init_dry_atm.F90:652:
   array args: use-associated
 calls: pmaxming
 is called by: init_dry_atm

fv_diag in file tools/fv_diagnostics.F90:279:
   array args: none
 calls: get_time
        get_date
        pmaxmin
        drymadj
        zsmean
        vort_d
        pv_entropy
        get_pressure_given_height
        get_height_given_pressure
        pmaxming
        age_of_air
 is called by: atmosphere_up

get_pressure_given_height in file tools/fv_diagnostics.F90:578: (openmp leaf)
   array args: wz, a2 (local to caller)
               ts [pt(1,beglat,nlev) in calls; use-associated]
 is called by: fv_diag

get_height_given_pressure in file tools/fv_diagnostics.F90:645: (openmp leaf)
   array args: wz, a2 (local to caller)
 is called by: fv_diag

age_of_air in file tools/fv_diagnostics.F90:710:
   array args: delp, peln, q (use-associated)
 calls: pmaxmin
 is called by: fv_diag
Note: age_of_air in file tools/age_of_air.F90:1: is not used!

pmaxming in file tools/pmaxmin.F90:4:
   array args: a [varies by caller, all use-associated]
 calls: pmaxmin
 is called by: fv_diag [a -> q (use-associated)]
               hydro_eq [a -> pt (use-associated)]
               read_fv_rst [a -> u, v, pt, q, u_srf, v_srf (use-associated)]

pmaxmin in file tools/pmaxmin.F90:30: (openmp leaf)
   array args: a [varies by caller]
 is called by: fv_diag [a -> ps, ua, va, omga (use-associated), a2 (local)]
               age_of_air [a -> age (local)]
               pmaxming [a -> tmp (local)]
               read_fv_rst [a -> zsurf (local)]
               write_fv_rst [a -> ps(1,beglat) (use-associated)]

vort_d in file tools/pv_module.F90:16: (openmp leaf)
   array args: u, v [use-associated]
               vort [local or use-associated in caller - see below]
 is called by: fv_diag  [use-associated omga as vort in call]
               init_sw_ic  [local vort in call]

pv_entropy in file tools/pv_module.F90:97:
   array args: pt, pkz, delp [use-associated]
               vort [use-associated, actual arg is omga]
 calls: ppme
 is called by: fv_diag

fv_dynamics in file model/fv_dynamics.F90:39:
   array args: u, v, delp, pt, q, ps, pe, pk, pkz, phis, omga, peln, ua, va [use-associated]
 calls: benergy
        p_energy
        pft2d
        cd_core
        tracer_2d
        tracer_2d_lima
        geo_map
        te_map
        compute_vdot_gradp
 is called by: atmosphere_down

benergy in file model/mapz_module.F90:2193: (openmp leaf)
   array args: u, v, pt, delp, q, pe, peln, phis [use-associated]
               tte [local to caller]
	       te, dz [work, actual args use-associated: ua, va (resp.)]
 is called by: fv_dynamics

te_map in file model/mapz_module.F90:26:
   array args: pk, q, delp, pe, ps, u, v, pt, ua, va, omga, peln, pkz [use-associated]
               pem [local to caller]
 calls: pkez
        map1_ppm
        map1_q2
        par_vecsum
        d2a3d
 is called by: fv_dynamics

pkez in file model/mapz_module.F90:2383: (openmp leaf)
   array args: pe, pk, pkz [use-associated]
 is called by: te_map

geo_map in file model/mapz_module.F90:830:
   array args: u, v, pt, delp, q, pe, pk, ps, omga, peln, pkz, phis [use-associated]
               ua, va [use-associated, work]
 calls: mapn_ppm
        map1_ppm
        p_energy
        par_vecsum
        d2a3d
 is called by: fv_dynamics [phis(1,jfirst) in call]

p_energy in file model/mapz_module.F90:1298:
   array args: v, pt, delp, q, pe, peln, phis, ua, va [use-associated]
 calls: d2a3d
 is called by: fv_dynamics [phis(1,jfirst) in call]
               geo_map [phis in call -> phis(1,jfirst) in call from fv_dynamics]

compute_vdot_gradp in file model/fv_dynamics.F90:568:
   array args: cx, cy [local to caller]
               pe, omga [use-associated]
 calls: pft2d
 is called by: fv_dynamics

tracer_2d_lima in file model/tracer_2d.F90:10:
   array args: q [use-associated]
               cx, mfx, cy, mfy [local to caller]
	       dp1 [actual arg use-associated: va]
               flx, va [work arrays ("useless output");
	               actual args use-associated: ua, pkz (resp.)]
 calls: split_trac
        tp2c
 is called by: fv_dynamics

cd_core in file model/dyn_core.F90:32:
   array args: dx, rdx, [both 1d]
               u, v, pt, delp, delpf, pe, pk [delpf not in fv_array.h]
               uc, vc, delpc, ptc, dpt, wz3, pkc, wz ["useless output"]
 calls: get_eta_level
        c_sw
        geopk <- leaf, openmp (in file model/dyn_core.F90:917:)
        pft2d
        upol5
        prt_maxmin_local
        d_sw
        geop_d <- leaf, openmp (in file model/dyn_core.F90:795)
 is called by: fv_dynamics

geop_d in file model/dyn_core.F90:795: (openmp leaf)
   array args: pe, delp, pk, wz, hs, pt
               [pk, wz: work arrays not in fv_array.h]
 is called by: cd_core
   actual args: pe, delp, pkc, wz, hs(1,jfirst), pt

geopk in file model/dyn_core.F90:917: (openmp leaf)
   array args: pe, delp, pk, wz, hs, pt
               [delp, pk, wz, pt: work arrays not in fv_array.h]
 is called by: cd_core
   actual args: pe, delpc, pkc, wz, hs(1,jfirst), ptc

update_fv_phys in file model/update_fv_phys.F90:1:
   array args:  u_dt,v_dt,t_dt,q_dt
                [allocated in fv_physics_down; driver/coupled/fv_physics.f90]
 calls: pft2d_phys [called separately with u_dt, v_dt and t_dt]
        polavg [called separately with t_dt and q_dt]
        get_atmos_nudge [called with all arr args]
 is called by: fv_physics_up [driver/coupled/fv_physics.f90]

d2a3d in file model/fv_pack.F90:1602: (openmp leaf)
 is called by: fv_restart
               te_map
               geo_map
               p_energy
               init_dry_atm  (km=1!)
               read_fv_rst  (km=1!)

p_var in file model/fv_pack.F90:1719:
   array args: delp,ps,pe,peln,pk,pkz,q [all use-associated - ok]
 calls: drymadj
 is called by: init_dry_atm [no array args]
               read_fv_rst [no array args]

drymadj in file model/fv_pack.F90:1839: (openmp leaf)
   array args: ps,delp,pe,pk,peln,pkz,q [all use-associated - ok]
 is called by: p_var
               fv_diag

maxmin_global in file tools/write_fv_rst.F90:265:
 is called by: write_fv_rst
               [first 4 places inside #ifdef SPMD;
	        5th outside, omp directive in maxmin_global
		probably can be eliminated and call
		parameters adjusted]

Wed, 28 Sep 2005

data: IPCC AR5

Points for Ron/Karl's talk at WGCM:

Several technologies are already or will be ready to be deployed in time for IPCC/AR5 (i.e prototype ready for test flights on IPCC AR4 data holdings by around 2008). Much of the energy requires to go to building consensus around standards.

Data volumes are likely to be too high to be served off a single site (even if it all mirrored in one place... we will probably need to spread the bandwidth).

Greater tolerance of grid diversity is needed: already evident on ocean side in AR4 and will continue to be so; likely to be manifest in atmosphere as well in AR5, eg cube-sphere.

the "1 dataset = 1 file" paradigm will likely have to be broken - it is unlikely that data formats and web servers where the bulk of data reside will be able to handle projected dataset sizes gracefully. Robust and transparent aggregation is a requirement.

It will be possible to describe the differences between model configurations in greater detail: expanding on CMOR vocabulary for describing IPCC experiments to include model metadata describing model components (media) and subcomponents (physics options, algorithms).

Basic server-side analysis capabilities will be possible: aggregation, regridding, subsetting, perhaps some degree of construction of new fields (e.g fluxes from mass and velocity fields).

Required operations in this environment:

Certification: will be handled via a metadata database, whose canonical version will reside at PCMDI, but which might be mirrored elsewhere. PCMDI will certify data quality as being up to snuff by passing it through their validator, which will verify that experiments, models and grids are correctly described, required fields are present and correctly organized. The database only contains the metadata, the actual data may continue to reside at the home institution. The DB will contain checksum information so the consumer can verify that the dataset being analyzed is the one certified.

Standardization: A grid metadata standard will be agreed upon by the modeling and data framework communities, and enshrined in CF soon. Both client-side (e.g ferret, grads, vcs) and server-side (cdat, fds, las) tools will learn to display and analyze data using the new grid descirptors. In particular, with such a standard, regridding for the purpose of analysis is likely to become a more routine operation than it is now. This will give consumers the option of analysing data on the native grid, or, perhaps with a more limited palette of options, on a "standard" grid.

Also model metadata for searching and understanding differences in model configurations: greater incidence of shared components compared to AR4.

Server-side data processing: at the minimum, subsetting and some regridding capability will be performed on the servers where data reside before delivery to the consumer. The data producer may supply custom regridding software adapted to the native grid, and will take responsibility for testing and QA of on-the-fly regridding. Some of this is compute-intensive and may require deferred processing and data delivery using web tokens (e.g "batch LAS").

So the bullets:

A distributed data archive: the data are dispersed, but can be searched and located through a relational database ("Curator") containing the metadata. The metadata are centrally held and certified as meeting AR5 requirements by PCMDI.

Extended metadata standards: via CF conventions, also centred at PCMDI. Metadata standards will allow diverse native grids, at the minimum, displaced-pole, tripolar, cubed-sphere. Description of experiments and scenarios will be extended to more precise descriptions of model configurations, components and subcomponents, also via CF model metadata.

Server-side data processing: aggregation, regridding, subsetting, including "batch" web services for computationally intensive analysis.

<!— FV shared pointers —>

The implementation of shared pointers is ready to be deployed: see fv_arrays.F90 and fv_arrays.h for the implementation.

Deployment will take two flavours:

routines where shared arrays are use-associated
routines where shared arrays are passed by arguments: in this case, one needs to track back up the calling tree to find calling instances and get them right.

Calling tree (created by \$HOME/src/perl/ftree: OMP-parallel routines in bold):

{include "calling_tree" nil t}

Tue, 27 Sep 2005

FMS: FV core shared pointers will be Cray!

The implementation of shared pointers is ready to be deployed: see fv_arrays.F90 and fv_arrays.h for the implementation.

Deployment will take two flavours:

routines where shared arrays are use-associated
routines where shared arrays are passed by arguments: in this case, one needs to track back up the calling tree to find calling instances and get them right.

Calling tree (created by \$HOME/src/perl/ftree: OMP-parallel routines in bold):

{include "full_calling_tree" nil t}

Mon, 8 Aug 2005

Curator talk?

We (me and who? Steve H? Cecelia etc.?) should probably present something on the <a title="Curator and Grid standards talk at MIT, July 2005" href="{url}curator_gridmeta2005.pdf">Curator</a> at <a title="IN15: Multidisciplinary Global Modeling: The Really Big Picture" href="http://www.agu.org/meetings/fm05/?pageRequest=search&show=detail&sessid=413"> AGU</a>. Abstracts are <a href="http://submissions4.agu.org/submission/entrance.asp"> due 8 September 2005</a>, and one must join the AGU by 15 August 2005.

Thu, 4 Aug 2005

FMS: FV core shared pointers on stack and heap?

The lima_vb branch has been updated with tests for two methods for creating shared memory arrays that are remotely addressible. Both require the use of the module mpt-1.12 and the CPP macros -Duse_libMPI -Duse_SGI_GSM. use_SGI_GSM requires MPI: I will modify fms_platform.h so that this is automatic.

For stack and automatic arrays, we use the GSM_Alloc (get techpubs/intel reference) call. This has been bound now to our mpp_malloc call. Here is an example of how to create a remotely addressible automatic array:

subroutine sub(n)
  real :: auto(n)
#ifdef use_SGI_GSM
  pointer( p, auto )
  integer, save :: len=0

  call mpp_malloc( p, n, len )
#endif

Now auto is an automatic array shared between all the PEs in the current_pelist.

mpp_malloc relies on the Cray pointer p automatically having the save attribute. This was true on Cray/Origin, can we verify that this is true on Altix?

For allocatable arrays, we use the MPI_SGI_GlobalPtr (get techpubs/intel reference) mechanism. This requires allocating on one of the PEs, and sending the address to the other PEs. On the Origin, it was sufficient to set the environment variable SMA_GLOBAL_ALLOC. This was sufficient to ensure that the numerical value of the address was the same everywhere: the allocating PE merely needed to broadcast the address.

On Altix, the address as seen by a remote PE is not the same numerical value. SGI added a call for us, MPI_SGI_GlobalPtr, which translates the remote address to one that is valid from the calling PE. This has been bound to two new calls. mpp_send_ptr and mpp_recv_ptr. Here is an example of using it:

if( pe.EQ.root )then
    allocate( a(n) )
    l = loc(a)
    do i = 1,npes-1
       call mpp_send_ptr( l, mod(root+i,npes) )
    end do
else
    call mpp_recv_ptr( l, root )
end if
call sub(l)    !instead of call sub(a), or make sub argument-less

subroutine sub(p)
  real :: a(n)
  pointer(p,n)

Now all PEs in the current_pelist are looking at the same array a(:).

In writing subroutine sub you have to be <b>very careful</b> to assign exclusive portions of a(:) to different PEs. You'll have race condition errors if you don't. However, any existing OMP code is guaranteed to have eliminated such race conditions.

The checkout for this experiment (based on m45_am2p13 in am2.xml):

<cvs>
 <codeBase>fms_fv_am2</codeBase>
 <modelConfig>lima</modelConfig>
 <cvsUpdates>
   cvs update -r lima_vb atmos_fv_dynamics shared/mpp
 </cvsUpdates>
</cvs>
<compile>
 <csh>
 cp /home/vb/fms/lima_vb/rts/ia64/m45_am2p13_lima_vb/src/path_names \
    \$code_dir
 </csh>
 <cppDefs>-DUSE_LIMA -DSPMD -Duse_libMPI -Duse_netCDF</cppDefs>
</compile>

The correct versions are obtained by setting the following compilation target:

   <target platform="ia64">
      <csh>
source /opt/modules/default/init/tcsh
module purge
module load modules icc.8.1.026 ifort.8.1.023 mpt-1.12 \
            scsl-1.5.1.0 idb.7.3.2
setenv MALLOC_CHECK_ 0
      </csh>
   </target>

Next step is to turn on the -Duse_SGI_GSM flag. Working on that now.

Wed, 3 Aug 2005

FMS: FV core shared pointers, cpp flag?

#ifdef use_GSM is used by Jeff for the changes to do direct copies into halos.

#ifdef use_MPI_GSM is used by Gerardo for the changes to do mpp_malloc-like stuff... should that be folded in?

Ask Gerardo about the use of sizeof(): the fortran bindings returns answer in bytes? works on types?

So the steps to use on Altix are

<blockquote> allocate on 1 PE<br>

share the pointer... does this mean everyone else but root_pe has to declare differently? </blockquote>

Mon, 1 Aug 2005

FMS: FV shared pointers

Create m45_am2p13_prelima by inheritance from m45_am2p13

Link the src: we want to use the same ~/fms/lima_vb/rts/ia64/m45_am2p13/src files as well as the path_names file with special compilation for atmos_fv_dynamics/model/tp_core.f90

Added lima_vb tag to atmos_fv_dynamics/driver/coupled/atmos_nudge.f90

Also added lima_vb tag to atmos_fv_dynamics/model/shr_kind_mod.f90. Koushik had created it in atmos_fv_dynamics/tools: that file no longer has the tag.

The XML to make it is now

<cvs>
 <codeBase>fms_fv_am2</codeBase>
 <modelConfig>lima</modelConfig>
 <cvsUpdates>
   cvs update -r lima_vb atmos_fv_dynamics
 </cvsUpdates>
</cvs>
<compile>
 <csh>
 cp /home/vb/fms/lima_vb/rts/ia64/m45_am2p13_lima_vb/src/path_names \
    \$code_dir
 </csh>
 <cppDefs>-DUSE_LIMA -DSPMD -Duse_libMPI -Duse_netCDF</cppDefs>
</compile>

Thu, 28 Jul 2005

FMS: MPP: Mosaic

While getting underway with the updates to MPP, Jeff, Zhi and I are also looking at a major code overhaul, as well as incorporation of features requested earlier but never implemented for one reason or another (usually sound ones:-).

Here's a list of possible features to incorporate into MPP (in no particular order):

making pelist opaque: currently it is a simple integer array of PE IDs, are we test internally (in get_peset) whether this list, when ordered, corresponds to an existing communicator. This is a potential performance problem. The proposed change will require the use of the existing mpp_declare_pelist routine to create pelists before use; perhaps the optional on-the-fly pelist argument available in many routines will be suppressed.
remove restriction that arrays passed into mpp_domains methods be declared on the data domain; instead this could be on a domain larger than the data domain. The domain2D datatype now understands three classes of subdomains.
support partial-width halo update
more compact code, removal of nested #includes, and perhaps elimination of the _old versions of stuff. This requires proof that the _new version performs at least as well, in a range of test cases, unit tests as well as system tests, over a range of resolutions and scaling (PE counts).

For Mosaic I am proposing a new definition interface. mpp_define_mosaic. It will reuse as much as possible the current, single-tile version of the software, called mpp_define_domains.

it's done in two stages

do n = 1,ntiles
   call mpp_define_domains( ... tile=n, pelist=pelist(n) )
end do

call mpp_define_contact_point(   tile1,     tile2,               &
                               (/i1,j1/), (/i2,j2/), align='X' )

Provide examples of this for cubesphere, tripolar with horizontal and vertical division.

Also the cyclic and folded cases which we currently treat as keywords can become contact_points instead!

FRE: hsmfiles

FRE: HPCS data transfer speeds

gridspec: second version of gridspec-tools released.

gridspec: first version of gridspec-tools released.

linux: dual computer, single keyboard and mouse.

linux: dual display

Clone display, laptop and projector

Dual display

FRE: checkpointing, add user control

FMS: online checkpointing bug

FMS: possible time manager bug

FMS: Online checkpointing

FMS Revised data estimates

FMS: Data output estimates

FRE: how FRE jobs get checkpointed

FRE: reducing queue wait time for frepp

FRE: breaking up FRE scripts into scriptlets.

FMS: use of dplace on the HPCS

FRE: fremake bugfix for top-level Makefile failure

FRE: strange perl behaviour of \$!

FMS: cube sphere and lat lon AMIP code

FRE: cvs updates and the new fremake

FRE: invoking dplace from frerun

FRE: frestatus redesign

FMS FRE: ensemble parallelization

What you can do in the interim:

FRE: bugs in new FRE

FMS: conventions for unit test programs

FRE FMS: the modules

latex: tex4ht getting better!

FRE: more on the TODO list

latex: hyperlatex

latex: tex4ht wrapper t4post

latex: tex4ht progress

fre: libraries for nalanda_2007_04?

FRE: Notes and fixes

FRE: immediate TODO list

tlemcen: configuring X for projectors

emacs: the latex-beamer class

FRE FMS: MI Team meeting 28 March 2007

FRE: Amy's FAQ on the dual-run capability

FMS: April 2007 patch to nalanda

FMS: using histx

Method for using histx

FMS FRE: the site configuration

FRE: the dual-run capability

emacs: muse, changed extension to "emu"

FMS FRE: notes on new repository policies and structures

Proposal:

emacs: muse issues

LaTeX: latex2html is still quite broken

FRE: "componentizing" the checkout and compilation

web: using CSS to create cobweb and www versions of the same page

tlemcen: kscd autoplay

netCDF: padding the file header

parallel computing: a new distributed OS?

emacs: setting muse project headers on a per-project basis

emacs: htmlize

emacs: muse web doc doesn't quite match my version

emacs: muse musings

FRE: Changes to FRE

grids: Gridspec status

FRE: notes on the FRE rewrite

Code restructuring

Changes to FRE file syntax

ESMF: Iredell's proposal: ~/doc/ms/gfdl/pilot3a.doc for pilot proj

Curator: Talks on Curator

parallel computing: FV runs Held-Suarez

FMS: FV core gets testing tag

SciDAC proposal notes

FMS: FV core ready for testing?

Instructions for moving the testing tag:

Compiling and running:

Running the lima version:

Running the memphis version:

FMS: FV core mod_comm changes

FMS: FV core, minor changes

FMS: FV core, pointer shortcomings

FMS: FV core works

FMS: FV core, fv_domain closed

web: using CSS to create `cobweb` and www versions of the same page

Instructions for moving the `testing` tag: