Coding for performance

As high performance is a vital consideration, we provide guidelines for coding that may assist in writing efficient code. These need not be followed in routines that are not critical to the overall performance of the code.

Memory management

The bulk of the memory is taken up with 3D fields. Care must be taken for efficient use of memory for module fields. This section deals with module internal fields and work arrays.

There are several performance considerations to keep in mind in designing a memory management strategy.

Calls to allocate and deallocate memory from the heap can be expensive as they often require system calls.^3.1 Calls to allocate and deallocate from the stack (such as automatic arrays) can be fast, but stack overflows generally overflow to the heap if the request is larger than the available stack.
Putting all arrays in static memory may inflate memory usage beyond practical limit. Besides, it contradicts a requirement of runtime-configurability.

In light of these, two strategies suggest themselves. One is for modules to allocate all or most of the required memory at initialization. Workspace can be managed through the use of simple user stacks, which are initialized by the module constructor (FMS:Init) and reused. Examples of user stacks are available in the modules in the MPP package, as described in MPP. In particular, we use very simple stack management, where there is no pervasive storage in the stack: each call can use all of the stack, and all of the stack is considered to be released on exit from the call.

Thread Safety

Scalable architectures can be divided into two broad classes: distributed memory, where processors each have independent memory hardware, and shared memory, where many processors have read and write access to the same physical memory. We can think of MPI and OpenMP as the canonical programming paradigms for these two architectural types. Increasingly, a hybrid architecture is becoming a basic form, usually called the ``cluster of SMPs'', where a group of shared memory nodes operates as distributed memory cluster.

A basic distributed memory programming model usually can target all these types of architectures, and that is the approach followed in FMS. We have not chosen to recast our models in a hybrid programming paradigm nesting the MPI and OpenMP approaches, choosing instead to limit software complexity.

Instead, the design choice in FMS is to have clearly demarcated regions of code where distributed memory parallelism and shared memory parallelism are evoked. The basic parallel construct is the horizontal domain decomposition outlined above in FMS:Parallelism. Inside each of these parallel regions, we may invoke shared memory parallelism in regions of code which are known to contain no horizontal dependencies, if the underlying architecture is known to deliver significant increases in performance or scalability with a shared memory programming model.

An example where this approach might be followed is in a spectral atmospheric model. The spectral transform method for distributed memory typically reaches its limit of scalable efficiency in a 1D decomposition (longitudinal for the FFTs, latitudinal for the Legende transforms). The decomposition is longitudinal when data is in grid space. The bulk of the computation is in grid space and is generally taken up in column physics routines (FMS:ColumnPhysics). Since these routines have no horizontal data dependencies, it is possible to parallelize further in this region of code using shared memory parallelism.

Which brings us to the issue of ``thread safety''. This is the somewhat imprecise term used to describe the organization of memory to allow multiple execution threads to use a shared region of memory, typically through OpenMP or equivalent directives. The key issue is to distinguish memory addresses as being private to a thread, or shared across threads (which users of earlier Cray parallel vector syntax may remember as task common and global common). Thread safety is the careful sorting of variables into thread-private and thread-shared, and careful control of how shared variables are updated. It is best in general to avoid updating of shared variables, and if it must be done, the code must be done in critical regions where multiple threads cannot create a race condition.

The thread safety considerations proposed for FMS include:

Only routines with no horizontal dependencies (e.g column physics) are permitted to have shared memory threads. Typically, use is restricted to the column physics routines described in FMS:ColumnPhysics.
Global storage in these routines is never updated by a shared memory thread: any variable that must be updated by one of these routines must be passed through an argument list.
No I/O is performed from shared memory threads (beyond simple notes to stdout and error messages).

More detailed thread-safety guidelines are provided in FMS:ColumnPhysics.

Pointers

The use of pointers in f90 is a subject of much debate. In general, the use of f90 pointers may be detrimental to performance, as it inhibits optimization. However, the standard itself requires dynamic arrays encapsulated within derived types to have the pointer attribute. This is now widely recognized within the community to have been an error in the standard: and future revisions of the standard will permit such arrays to have an allocatable attribute. This in fact is already available as an extension in many compilers, but not widely enough to be usable here.

FMS is also required to be parsed by automatic source-to-source differentiation tools to generate adjoint and tangent linear models for data assimilation. It has been found that the use of f90 pointers places insuperable demands on automatic differentiation. However, many of these can be overcome by placing a restriction that f90 poonters be static, i.e, once assigned, they will never be redirected. FMS adopts this restriction for code segments subject to automatic differentiation. For code segments violating this restriction, the developer is required to provide adjoint code so that automatic differentiation for that code segment may be avoided. The adjoint requirement is not currently in force, but will be shortly.

Another style of pointer of much utility is the Cray pointer. This is outside the f90 standard, but is universally available on compilers on designed for high-performance, including all the major scalable and vector system native compilers, as well as several compilers designed for HPC on Linux clusters. Its utility is in avoiding memory-to-memory copies, and in writing interfaces interoperable with C. Cray pointers are used in FMS in performance-critical low-level utilities such as the communication kernels (MPP).

Footnotes

... calls.^3.1: If a process has once allocated memory to itself and then deallocated, that portion of memory can generally be reused without a system call to assign another memory arena. This optimization is however not guaranteed on all platforms; and besides is only useful if subsequent allocations fit within the present one.

Next: Interfaces for component models Up: The FMS Manual: A Previous: General design specification Contents

Author: V. Balaji
Document last modified