## STATISTICS & DATA ANALYSIS TECHNIQUES

**Introduction**

A scientist must call upon different techniques in order to carry out research. One of the main tools used at GFDL is that of the physical (mathematical) model, for example a GCM, used to simulate the physical system using the basic laws of physics. Another approach, which is of particular interest to me, is to use statistical tools, based on a different set of laws, to examine data. While meteorologists are in general trained in basic traditional statistics there is a wealth of statistical knowledge which the meteorological community has yet to tap. I try to keep abreast of the developments in the field of statistics (hindered by my limited background) in the hope that I will stumble upon something useful.

**Eigenvector Analysis**

One of my specialties is the use of eigenvector analysis techniques which encompasses Principal Components/Empirical Orthogonal Functional Analysis. In some earlier projects which were predictive in nature (Harnack et al., 1982a; Harnack and Lanzante, 1984; Harnack and Lanzante, 1985; Harnack et al., 1986a/b/c) EOF’s were used to extract the major signals in oceanic and atmospheric fields. These techniques are also quite useful in diagnostic studies as well (Harnack et al. 1982a; Lanzante 1984; Lanzante and Harnack, 1984; Lanzante 1990; Lanzante 1991; Lanzante 1996). Over the course of these studies the virtues and nuances of rotation of the eigenvectors have been realized.

In my study of the relationships *between* the atmospheric circulation in the western hemisphere and sea surface temperatures in the North Pacific and North Atlantic I employed a variation on EOF analysis which was the forerunner of SVD (Singular Value Decomposition) which has gained popularity in recent years. I have also found *complex* eigenvector analysis to be quite powerful in examining phenomena which propagate or evolve over their lifetime (Lanzante 1990;Lanzante 1991, Lanzante 1996)

**Resampling & Monte Carlo Simulation**

Screening multiple linear regression was also used extensively in the predictive studies cited in the previous paragraph and was the focus of Lanzante (1984a) in which model building and the assessment of skill and significance was addressed. The use of resampling (jackknife, bootstrap, cross-validation) and other Monte Carlo approaches have been used in many of my publications. It is often the case in the study of climate that the statistical significance of a particular test can not be assessed using a traditional approach because the data are not independent in time and/or space or because it has been necessary to embark on a “statistical fishing expedition” in a small sample of data. These Monte Carlo strategies can come to the aid by substituting raw computing power for certain statistical assumptions. Of course one must employ careful thought to the design of these Monte Carlo schemes — each problem may require a somewhat different approach.

**Resistant, Robust & Nonparametric Techniques**

During the 1990s I began grappling with the issue of quality control (of radiosonde data) and with the general problem of the analysis of “messy data”. By this I mean data which are contaminated by outliers or are not Normally distributed. The problem is that these “defects” in the data can render invalid most of the common statistical techniques which meteorologists regularly employ. As it turns out statisticians have alternative techniques which are not much affected by these defects. In a review article which I have written on this subject (Lanzante 1996) I present some such alternatives to the mean, standard deviation, t-test, correlation, regression, and some other commonly employed statistical measures. Some of these alternatives have been adopted for quality control in NCDC’s Global Historical Climatology Network (GHCN), a collection of monthly land surface observations from several thousand stations. The GHCN is a WMO global baseline data set which has wide use in the study of climate variability. Similarly some of these methods have been employed in the creation of a dataset of daily meteorological observations in China (Feng et al. 2004, Int. J. Climatol., 24, 853-870). In my manuscript I also present a technique which I have developed to find “discontinuities” in time series — this is an important issue with regard to the analysis of long series of radiosonde data because changes in instruments and recording practices have contaminated some of the data with these artificial discontinuities.

**Spectral Analysis**

During the early 1990s I was introduced to a relatively new approach to spectral analysis known as Multitaper Spectral Analysis. This approach was devised by David Thomson of AT & T Bell Labs during the early 1980s and is clearly documented in the text by Walden and Percival (1993),”Spectral Analysis For Physical Applications: Multitaper and Conventional Univariate Techniques”. Traditional spectral analysis is performed by applying a single taper to the data, followed by fourier transformation which yields periodogram estimates which are then smoothed using a window. The tapering is aimed at reducing leakage which leads to bias of the estimated spectrum. The windowing is aimed at reducing the variance of the estimated spectrum but at the cost of reduced frequency resolution. By contrast, multitaper spectral analysis performs tapering using more than one taper and then the separate periodogram estimates (from each taper) are averaged instead of windowing. The result is that for a given bandwidth the multitaper method generally produces a better combination of low bias and low variance than the traditional method, or conversely, for a given bias and variance the multitaper resolution is greater. In my future work I plan to utilize the multitaper approach when spectral and cross-spectral analysis is appropriate.

**Other**

In the course of studying climate variability and change sometimes it is necessary to deal with statistical issues that arise. In one instance we developed an approach to studying the nature of climate change, by comparing linear and nonlinear measures of change (Seidel and Lanzante, 2004). In another instance I was motivated by comments made during the course of a climate workshop. Subseqent examination of the recent literature confirmed my suspicions that there is some widespread misunderstanding regarding the use of “error bars” in the analysis of climate data. This was pointed out along with some simple illustrative examples in a brief correspondence (Lanzante 2005).

**Plans**

I continue my regular perusal of a number of statistical journals for the latest developments. With my new focus on statistical downscaling, I have been expanding my portfolio. In addition to methods directly related to statistical downscaling, I am exploring auxiliary techniques, particularly those pertinent to distributions and extreme values.

Return to John Lanzante’s Home Page.