next up previous contents
Next: 12.5.1 Calculating row boundaries Up: 12. Multi-tasking Previous: 12.4 The distributed memory

   
12.5 Domain Decomposition

If all arrays were globally dimensioned12.4 on all processors, available memory could easily be exceeded with even modest sized problems. Each processor should only dimension arrays large enough to cover the portion of the domain being worked on by the specific processor. Refer to Fig 12.1a which is an example of a domain with i=1,imt longitides and jrow=1,jmtlatitudes divided among 9 processors arranged in a 2d (two dimensional) domain decomposition. Fig 12.1b gives the arrangement for a 1d (one dimensional) domain decomposition in latitude using the same 9 processors. In both cases, two dimensional arrays need only be dimensioned large enough to cover the area worked on by each processor. As discussed below, this area will actually be one or two cells larger at the processor boundaries to include boundary cells required by the numerics. Boundary cells on each processor must be updated with predicted values from the domain of the adjoining processor. The arrows indicate places where communication takes place across boundaries between processors.

Figure 12.1c indicates three cases: one for 9 processors, 100 processors, and 900 processors. Within each case, three domain shapes are considered: imt=jmt, imt=2*jmt, and imt=jmt/2. The tables indicate how many communication calls are required. Since both processors which share a boundary require data from the other processor, two communication calls are required at each boundary and the same amount of words are transferred across the boundary for each processor. Two things are worth noting. First, as the number of processors increases, the number of words transferred in the 2d domain decomposition is much less than in the 1d domain decomposition. Second, for large numbers of processors, 1d domain decomposition requires half as many communication calls as 2d domain decomposition. An additional problem for 2d domain decomposition which is not indicated in Figure 12.1c is that the equations are not symmetric with respect to latitude and longitude. Specifically, a wrap around or cyclic condition is placed on longitudes for global simulations. No such condition is needed along latitudes. The implication is that many additional communication calls will be needed to impose a cyclic condition on intermediate computations for 2d domain decomposition. These communication calls are not needed in a 1d decomposition in latitude because each latitude strip including the cyclic boundary is on processor. To lessen the need for extra communication calls in 2d decomposition, additional points can be added near the cyclic boundary thereby extending the domain slightly.

Whether 1d or 2d decomposition is better depends on the the latency involved in issuing a communication call versus the time taken to transfer the data. Another factor is whether polar filtering is used because it would also require extra communications in a 2d decomposition. At this point, the 1d domain decomposition of Fig 12.1a is what is done in MOM.

Executing the 3 degree global test case #0 for MOM on multiple processors of a T3E using 1-D decomposition in latitude indicates that scaling falls off quickly when there are fewer than 8 latitude rows per processors. This result has dire implications for climate models. For instance, a 2 degree global ocean model will have about 90 latitiude rows. Efficient scaling implies only 11 processors can be used. On the GFDL T3E, it takes about 10 processors to equal the speed of one T90 processor. If it is assumed that processors on the next system are 2.5 times as fast as the T3E processors and the ratio of network speed to processor speed stays the same, a two degree climate model will only gain a factor of about 3 in speed over a T90 processor. If the processor speed increases faster than the network speed, then the situation gets worse. The implication is that 2-D domain decomposition will be needed for climate modeling if an order of magitude speed up is to be attained..

On the other hand, a 1/5 degree global ocean model with 50 levels has 900 latitude rows and requires 2 gigawords of storage with a fully opened memory window. Using 9 rows per processor implies that 100 processors can be used with 1-D decomposition. Each processor must have at least 20 megawords. Realistically, by the next procurement, each processor will have at least 32 megawords of memory. If the GFDL T3E had enough processors, then a speedup of 25 times that of one T90 processor would be realized. It seems that 2-D decomposition is more important for coarse resolution models.

Since the majority of simulations with MOM have been carried out on vector machines at GFDL, the idea has been to compute values over land rather than to incur the cost of starting extra vectors to skip around land cells. In an earlier implementation, extra logic was in place to skip computations on land cells. It turned out that the speed increase on vector machines was not worth the extra coding complexity so the logic was removed. On scalar machines, it may be beneficial to skip computation over land areas. A 2d domain decomposition would in principle more easily allow computation over land to be skipped by not assigning processors to areas containing all land cells. In a 1d decomposition, extra logic would need to be added to eliminate computation over land cells.



 
next up previous contents
Next: 12.5.1 Calculating row boundaries Up: 12. Multi-tasking Previous: 12.4 The distributed memory
RC Pacanowski and SM Griffies, GFDL, Jan 2000