Quantitative Programming Environments

A.I. McLeod, Revised: December 7, 2000
© A.I. McLeod, 2000; aim@uwo.ca

Computational Science

Charles Van Loan (1996) has made the point that there are three major components to Science - see diagram. Over the past centuries, Science was dominated by the Theoretical and Experimental sides since our computational powers were relatively feeble. Today with the advances in computer technology, Computational Science is becoming of equal importance.

[Graphics:Images/qpe_gr_1.gif]

Both Theoretical and Experimental Science have long traditions and high standards. However the rapid growth in research in Computational Science has sometimes not been of the same the high quality that is present in Theoretical and Experimental Science. Charles Van Loan (1996) makes the analogy of the automobile. It is well known that nice people sometimes turn not so nice when they are behind the wheel of this powerful machine. For those of participating the Computational Science endeavour, we must strive to attain the highest possible standards.

Reproducible Research

Many years ago Andrews & Hoaglin (1975) discussed problems apparent in the reporting of investigations. Their essential point was that a lot of results being reported in articles in refereed journals were essentially not reproducible. Some progress has been made on this front in the journals published by the American Statistical Association (ASA) such as the Journal of the American Statistical Association. It is now required that unless there are sound reasons given not to, all datasets appearing in the journals should be made available electronically (JASA Datasets) and publication of specialized software developed by the authors is also made available (JASA Software). Here are some guidlines from the JASA Homepage for Authors:

Data—Whenever a data set is used, its source should be fully documented. When it is not practical to include the whole of a data set in the paper, the paper should state how the complete data set can be obtained. Unless conditions of security or confidentiality intervene, availability of the data on which the paper is based is a requirement for publication.

Results Based on Computation—Papers reporting results based on computation should provide enough information so that readers can evaluate the quality of the results. Such information includes estimated accuracy of results, as well as descriptions of pseudorandom-number generators, numerical algorithms, computers, programming languages, and major software components that were used.

Not all good journals in the have the same standards as JASA. For example, a researcher publishing statistical work in a business journal told me that the data is never published and that if someone were to request it they would be expected to let the author who provides this data be a co-author on their paper. This type of situation seems to be rampant in many other fields where conclusions given about a data analysis are the main result of the paper but the data is either unobtainable or else very difficult for another researcher to reconstruct from the sources given. This clearly is not science. I believe that most of the conclusions given based on statistical analysis in many of the application fields of statistics and data analysis are not warranted.

Recently Dave Donoho who is one of the most brillant and active researchers in Statistics today whose speciality happens to be wavelets. Buckheit & Donoho (1995) argue that much of the research in wavelets belongs to computation science and what is of interest in this situation is not a general level theoretical description but the exact details of how things are done, in their own words:

"An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures".

In their view it is scandalous for researchers to publish articles without sufficient information for the reader to be reproduce and verify the results with ease. So when the wavelet team led by Donoho and Johnstone at Stanford set about to do research, they provide MatLab scripts to document everything that they do and they provide these MatLab scripts along with the text versions of the articles on the web. In their view, the quality and productivity of their research has been greatly enhanced by this approach. A comprehensive MatLab toolkit, WaveLab, for wavelet analysis has been developed by this group. This toolkit will be discussed at length in my next lecture.

In the view of Donoho and colleagues at Stanford University we must in computational science ensure that our research is reproducible by other research and any less is scandalous and unacceptable. This is a necessary condition for it to be called a Science. To quote from

Buckheit & Donoho (1995) discuss five personal anecdotes about the costs of not engaging in reproducible research: (i) Burning the Midnight Oil; (ii) The Stolen Briefcase; (iii) Who's on First?; (iv) A Year is a Long Time in this Business; (v) A la Recherche des Paramètres Perdus

To quote Donoho et al. (2000) from the document About Wavelab:

We conduct our research with the idea, from the beginning, that we will implement our tools in WaveLab. We believe that the discipline that this entails makes our research of higher quality than would otherwise be possible.

Free Software

I would also like to make mention of the fact that there is a nearly universal consensus among the most sophisticated computer users that the current paradigm of commerical software development, as exemplified by Microsoft Corporation, often leads to poor quality software. GNU and free software are often markedly better than commerical products. The reasons for this have been discussed in the fascinating book by Eric Raymond (1999). The most important factors to consider are the very large number of expert programmers who do not attempt to earn a living by selling software. About 99% of all expert level computer programmers fall into this category. When a software system like linux is made freely available there are many talented programmers who use it and develop it further. I think there is a profound truth in Raymond's maxim: "Given enough eyeballs, all bugs are shallow".

Overview of QPEs

Modern advanced high-level QPEs that I am familar with include Mathematica, Splus/R, MatLab/Octave and Maple. Older and less satisfactory QPEs include: Excel and APL. In addition there are numerous specialized and lower-level programming environments such as SAS, Stata, Microsoft Visual Studio for Fortran, C, and Java.

By a high-level computing environment we mean one where the computing algorithms are specified in a way similar to how the researcher or mathematician might think of them. Iverson (1980) received the prestigious Turing Award from the Association for Computing Machinery for his development of APL. The central theme in this paper is that APL is really just an improvement on mathematical notation in that it is more precise and it is executable of a computer. Stephen Wolfram the main developer of Mathematica was very much influenced by these ideas and he incorporated most of the APL-like ideas on functions and operators in Mathematica in a much more general way. There is no high-level computing environment today which provides as unparalleled features as Mathematica, viz: technical typesetting and document prepartion, sophisticated programming in a variety of styles (procedural like C if you wish or functional like APL or lisp-like pattern), function and data visualization capabilities, accurate numerics with floating, arbitrary precision or exact rational computations, state-of-the-art symbolics.

It is interesting to note that the first spreadsheet, VisiCalc, was developed by APL Programmers, Dan Bricklin and Bob Frankston, see: http://www.danbricklin.com/log/ and http://www.bricklin.com/visicalc.htm. Bill Gates has credited spreadsheets as being the "killer application" which resulted in the widespread usage of computers in the business environment. Spreadsheets encourage column and vector thinking just like APL did. Such vectorized thinking leads to a simpler, more streamlined and mathematical approach to computations. Vectorization is now common in most QPEs.

Another Iverson idea that I would like to mention is that for those of involved in education and research, it is very important that we understand precisely the computations we are doing. Basically if you understand something well then you can explain it precisely. In the old days, this would mean going through the mathematical details. Today if we use a sufficiently advanced QPE we should not worry too much if the specific algorithms we are interested are available as canned functions. As students and educators we should be able to easily specify them ourselves.

The specialized lower-level programming environments do not qualify as QPEs since the specification of computations is often much more complicated and it is very hard to organize your work to achieve reproducible research. The less satisfactory QPEs are available only a single platform or a limited number. As for the only advantage of these lower-level environments - computer efficiency and speed - that is becoming less and less important. As Buckheit & Donoho (1995) point out algorithms coded in a high-level QPE although slower by a factor of perhaps a few hundred are still faster than the efficiently coded lower level ones running on a state-of-the-art machine of 12 months ago!

Mathematica

I have no doubt that Mathematica is by far the best QPE in terms of quality. Its main disadvantage is the price and secretative nature of some of the algorithms. Mathematica is available in our department's MS-Windows PC Lab (WSC Room 256, 50 licenses) and in our graduate lab running linux (WSC 277, 6 licenses, arva.stats.uwo.ca). However you don't always need to ride in the cadillac to get there! I have provided a comprehensive introductory tutorial to Mathematica here: Tutorial Introduction to Mathematica (http://www.stats.uwo.ca/computing/Mathematica/tutorial98.nb ). And links to some web resources for Mathematica: http://www.stats.uwo.ca/computing/Mathematica/default.htm There are numerous Mathematica packages available from Wolfram Inc. as well as from 3rd parties. Users have also contributed numerous free packages. Wavelet Explorer provides the basic tools for studying wavelets.

MatLab

MatLab has been developed by a researchers at MathWorks Inc. under the leadership of the distinguished numerical analyst Clive Moler. Like Mathematica, MatLab is expensive. MatLab was developed as high-level QPE focusing on numerical linear algebra. It features some of the APL-like vectorization ideas but in a much more limited way than Mathematica or Splus/R. MatLab has been described as the Fortran of the 90's since it is widely used by engineers as well as applied mathematical researchers. MatLab, like Mathematica, does not have very strong support for advanced modern statistical analysis (generalized linear models, nonparametric regression, CART, etc.) that are widely available in other packages but this can easily change. MathWorks Inc. provides numerous add-on toolkits for MatLab such as Maple, Wavelets, Neural Nets, Nag Algorithms, Document Prepartion, Signal Processing, Statistics, C Compiler.

MatLab is available at UWO in the General Purpose Computing Labs as well as on our linux server (arva.stats.uwo.ca). Some MatLab web resources are available here: http://www.stats.uwo.ca/computing/MatLab/default.htm.

We plan acquire MatLab for our undergraduate PC Lab and to use MatLab in our Statistics 241b and Statistics 260a/b courses. I working on a web page for MatLab tutorials: http://www.stats.uwo.ca/computing/MatLab/mcleod.htm

There is a free GNU clone of MatLab called Octave also available: http://www.octave.org which is currently at Version 2.0.16 (January 18, 2000). Octave is available on arva.stats.uwo.ca.

Splus, S and R

The programming S orginated at Bell Labs cica 1985 and was distributed at little cost to universities and non-profit research institutes, until around 1990 there were about 10,000 sites where S was installed. A private company now called MathSoft appeared on the scene in the early 1990's and under license from Bell Labs (Lucent Technologies) distributed a commerical and enhanced version of S known as Splus. The main enhancement was to provide executable binaries for a variety of platforms. The principal developer for S was John Chambers but many others have been involved. The Association for Computing Machinery presented its 1998 Software System Award to John Chambers of Bell Labs for the design of the S system. The ACM citation stated that ``S has forever altered the way people analyze, visualize, and manipulate data.... . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers".

S is still under development by John Chambers and he released a new book and version of the program last year.

Researchers in New Zeland, Robert Gentleman & Ross Ihaka, developed a free GNU version of S they nicknamed R. Today the core R developers include John Chambers and many leading academics such as Brian Ripley. R provides a high quality QPE for many statistical researchers.

These programming environments provide a very sophisticated vectorized programming environment. Several thousand core statistical and mathematical functions are available. There is a large user base of contributed software. S/Splus and R feature special functions for dynamic data visualization and trellis multipanel displays.

Web resources, including tutorials, for S/Splus and R are available here: http://www.stats.uwo.ca/computing/Splus/default.htm and http://www.stats.uwo.ca/computing/R/default.htm

For wavelet analysis, MathSoft has developed a wavelet module which is perhaps the most extensive collection of wavelet functions for statistical analysis available. This wavelet module is only available with Splus. Splus 5 with this module is installed on arva.stats.uwo.ca.

In researchers at the University of Bristol (Nason & Silverman) have developed WaveThresh which is free and is available for S, Splus and R. WaveThresh provides only basic wavelet capabilities for 1d and 2d transforms.

Comparisons

yes yes yes yes yes yes

limited best good minimal yes limited

yes yes yes no yes no

Maple yes no yes no no

yes best yes yes yes poor

no excellent no limited no poor

excellent excellent statistical yes no poor

yes yes statistical yes no no

yes yes limited no no no

yes yes yes yes no yes

Programming Example: The Periodogram

The periodogram provides an unbiased estimate of the spectral density function. The periodogram is usually computed at the Fourier frequencies, where .

The periodogram provides an unbiased estimate of the spectral density function. Another interpretation of is that it is proportional to the coefficient of determination of the regression of the data on a sinusoid with frequency .

MatLab

function[x,y]=periodogram(z)
ft=fft(z);
np=floor(size(z,1)/2);
ft=ft(2:(np+1));
x=(1./size(z,1)).*(1:np);
y=ft.*conj(ft)/size(z,1);

Mathematica

[Graphics:Images/qpe_gr_25.gif]

Splus

periodogram<-function(z) {
n <- length(z)
(Mod(fft(z))^2/n)[2:(n %/% 2+1)] }

Dynamic Visualization

Point Cloud

The NGC spiral galaxy datasets is discussed in some detail in Cleveland (1993) and the dataset itself has been published on StatLib and is also available here:

galaxy dataset

The data consists of 332 measurements on the velocity away from us at different locations. The overall average velocity of recession is about 1600 km/sec but there are variations of 350 km/sec. We want to determine regions where the velocity is high, low and in-between.

Link to

MatLab script file for pointcloud plot

MatLab workspace for pointcloud

Bivariate Normal Density Function

The bivariate normal density function with unit variances and correlation coefficient may be written

[Graphics:Images/qpe_gr_27.gif]

A MatLab script files for visualizing this 3D surface is available here:

3D Plot

and

3D Movie

Dynamic Data Analysis

Advanced software for dynamic data analysis is available in Splus with the brush and spin functions. An even more comprehensive approach is provided by the software XGobi.

We illustrate dynamic data analysis with the iris dataset.

Animation with Mathematica

Mathematica notebook CLT Animation

Mathematica notebook Time Series Plot Animation

Symbolics: Maple and MatLab Comparison

The mean-square error (mse) of an estimator is defined as . The relative efficiency of vs. is defined as Barnard, Jenkins and Winsten (1962) suggested that in many small sample situations where and denote respectively the maximum likelihood and mean likelihood estimators.

We will now examine the performance of these two estimators in the estimation of the parameter in a sequence of Bernoulli trials where is the observed number of successes and is the probability of success. The probability function is

So if successes are observed in trials, the likelihood function may be written and the mle may be derived by calculus, . Using Mathematica it is easily shown that the mele of is and that provided:

[Graphics:Images/qpe_gr_48.gif] .

As shown in the figure below, the mele as well as the Bayes estimator is more efficient over most of the range but the relative efficiency tends to 1 as .

[Graphics:Images/qpe_gr_50.gif]

Mathematica Derivation of

See

Mathematica Notebook

or HTML version of Mathematica Notebook

Maple Derivation of

See: Maple Worksheet or HTML version of Maple Worksheet

References

Barnard, G.A., Jenkins, G.M. & Winsten, C.B. (1960) Likelihood inference and time series Journal of the Royal Statistical Society Series A 125, 321-372.

Buckheit, Jonathan and Donoho, Dave (1995), "WaveLab and Reproducible Research" in the book Wavelets and Statistics edited by A. Antoniadis and G. Oppenheim, Springer-Verglag, pp.55-82. Web Link: http://www-stat.stanford.edu/~donoho/Reports/index.html

Cleveland, W.S. (1993), Visualizing Data, Hobart Press.

Hoaglin, D.C. & Andrews, D.F. (1975), The reporting of computation-based results in statistics, The Americian Statistician 29, pp.122--126.

Iverson, K. (1980), "Notation as a tool of thought", Communications of the Association of Computing Machinery, Vol. 23, pp.444--465.

Raymond, Eric (1999), "The Cathedral & The Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", O'Reilly Computer Books. Web Link: http://www.tuxedo.org/~esr/writings/cathedral-bazaar/

Van Loan, Charles (1996), "If Copernicus had a Computer", Special Expository Lecture, Society for Industrial and Applied Mathematics Publications, Philadelphia, PA. 1995. Video.

Converted by Mathematica December 7, 2000


yes	yes	yes	yes	yes	yes
limited	best	good	minimal	yes	limited
yes	yes	yes	no	yes	no
Maple	yes	no	yes	no	no
yes	best	yes	yes	yes	poor
no	excellent	no	limited	no	poor
excellent	excellent	statistical	yes	no	poor
yes	yes	statistical	yes	no	no
yes	yes	limited	no	no	no
yes	yes	yes	yes	no	yes