These instructions are written for R version 1.4 for Windows, but much of what we do will work in all versions on all platforms.
R is freeware. You can download it from the web and install it on your own computer. (See the instructions below if you want to do this.) It is also available for use on all of the PCs in the Statistics Computer Lab (WSC 256), with TA's to help on weekdays (from 10 AM to 6 PM, starting September 18).
A similar commercial package called S-PLUS is also available. Student versions of S-PLUS cost around $100. If you have one of those, it should be sufficient for this course; if not, I'd suggest sticking with R.
You can download R from the web; start at http://cran.r-project.org. Choose your operating system from the list, e.g. R for Windows. Keep following the links until you get to the base directory, where you will see a list of files. Download SetupR.exe (approximately 19 megabytes). Run it, and follow the instructions to install R in the directory of your choice.
You don't need any of these now, but later you may want to download contributed packages for R. These contain statistical methods that aren't in the base package. To download those, run R, and click on Package|Install package from CRAN.
To start R:
R : Copyright 2001, The R Development Core Team Version 1.4.0 (2001-12-19). R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type `license()' or `licence()' for distribution details. R is a collaborative project with many contributors. Type `contributors()' for more information. Type `demo()' for some demos, `help()' for on-line help, or `help.start()' for a HTML browser interface to help. Type `q()' to quit R. >This is the R console window; everything you do in R will be typed in this window.
There are several ways to shut it down:
R remembers objects between sessions, so if you create a dataset and then have to quit, it will still be there the next time you sign on.
If you know the name of an R function, you can get help on it through the Help menu or by typing ?name in the Commands Window. Warning: it takes a while to get used to the help files; they often give more detail than you want, without answering your question!
Everything that you work with in R is an object of some sort. The most common kinds of object are:
To create a scalar, just assign a numerical value to a variable. For example,
x <- 5 y <- x + 3Note that assignment is done with <-. You may also use the underscore, e.g. x _ 5 to do the same thing: the underscore is sometimes easier to type, but can be confusing.
To create a vector, assign a vector value to a variable. The c(...) function is the simplest way to create a vector value:
x <- c(1,2,3,4,5) y <- xv + 3 y[3] <- 1 # Assign a value to the third elementAlmost all arithmetic operations can be done on vectors as well as scalars. They simply operate on one element at a time. You can mix vectors and scalars in expressions; it acts just as though the scalar was repeated for every element of the vector.
Watch out if you mix vectors of different lengths: you'll hardly ever get the result you want!
To see the contents of an object, just type its name on a line by itself, and the current value will be printed.
x <- c('red','green','blue')
y <- c(x,'yellow','magenta','cyan')
Use the data.frame() function to construct dataframes from vectors:
x <- c('red','green','blue')
y <- c(1,2,3)
d <- data.frame(x,y,z=c(4,5,6))
Note that you use = to set the value of the argument of a function,
rather than <- which assigns a value to a variable.
names(d) # Print the names of the objects in the d dataframe. d$x # Print the x component of d
x _ c(1,2,3) y _ c(4,5,6) plot(x,y,main='My title')R has a somewhat crazy scheme for handling abbreviations of argument names; avoid it! Type the argument name in full, or risk getting very, very confused.
You can use assignments like
x _ c(1,2,3)to create small datasets. However, this quickly gets cumbersome. The best way to enter a larger dataset is to use an external editor (like Notepad) or a spreadsheet. What you want to create is a ``comma-separated-variable'' file, with titles at the top of each column.
For example, here are a few lines from a dataset on fuel consumption of my car:
Km, Day, Month, Year, Fill, Cost, Litres, City 530.5, 6, 7, 1997, 1, 18.5, 32.5, 0 838, 19, 7, 1997, 1, 13.25, 23.7, 0 1288, 19, 7, 1997, 1, 19, 32.3, 0 1800, 24, 7, 1997, 1, 19.5, 32.8, 0If I had entered these into a file called C:\TEMP\CAR.CSV, I could load them into a variable called car like this:
car <- read.csv('C:\\TEMP\\CAR.CSV')
The read.table function can handle other formats; see
?read.table for the details.
R is often used as a sophisticated desk calculator. Remember that most operations can be done on whole vectors at once. There are also functions to calculate statistics from vectors. Some useful functions are used below:
x <- 1:10 # The numbers 1 to 10 mean(x) # The sample mean var(x) # The sample variance sd(x) # The sample standard deviation y <- 11:20 # The numbers 11 to 20 var(x,y) # The sample covariance cor(x,y) # The sample correlation median(x) # The sample median summary(x) # Several useful statistics
There are quite a few built-in functions for generating random numbers and working with their distributions:
x <- rnorm(100) # Generate a vector of 100 Normal random values plot(x,dnorm(x)) # Plot them against their density function y <- runif(100) # ... and a vector of 100 Uniform random values plot(y,dunif(y)) # plotted against their density function.Other built-in distributions include Student's t, Chi-square, gamma, F, lognormal, Poisson, binomial, ...
One place in which R excels is in graphing. It is very flexible, but at the same time, simple plots are fairly easy to do.
For analysis or presentation graphics, some commonly used functions are:
plot(x,y) # A scatterplot plot(x,y,type='l') # As above, with the points joined by lines hist(x) # A histogram stem(x) # A couple of plots that piechart(x) # you should probably never use!
There is also a whole family of functions to add to a graph:
abline(a=3,b=4) # Add the line y=a + b x to the plot lines(x,y) # Add lines joining the data points points(x,y) # Add points to the plot text(x,y,labels=y) # Add text to a plot
To print a graph, select it on the screen, then use the File|Print menu selection. You can choose the dot matrix printers or the laser printer. Ask the TA in the Lab for instructions on how to use each. You can also save graphs in various file formats to be printed later, or to put on a web page.
R maintains a ``history'' of past commands that you have executed. You can retrieve these by hitting the up arrow on the keyboard in the command window. Once you've retrieved a previous command, you can edit it and hit Enter again to execute the changed version.
When you get to more than a few lines of code, it's a good idea to use a separate editor window to edit your code. For example, open Notepad beside your R window, and type your commands there. When you've got them right and want to execute them, there are two ways to proceed.
source('C:\\temp\\script.r',echo=TRUE)
in your R command window.
If your script contains more than one plot, the first ones will be displayed and lost right away. To avoid this, go to the graph window, and click on History|Recording. Then all plots will be saved, and you can use the PageUp and PageDown keys to switch between them.
Eventually, you'll find that you are repeating the same code. At that point it's best to write a function, so that you only need to type the complicated procedure once, but can use it many times.
Functions have three parts: a header, a body, and a return value. The header tells R how your function expects to be called. The body defines what the function will do. The return value (which is the value calculated in the last line of the body to be executed, or what you pass to return(), is what the system sees after your function has executed.
For example, to calculate the mean of all values except the biggest and smallest in a vector (a ``trimmed mean''), you could use the following function:
trimmedmean <- function(x, trim=1)
{
x <- sort(x) # sort into increasing order
x <- x[-(1:trim)] # delete the smallest values
x <- x[-(length(x)+1-(1:trim))] # delete the largest values
mean(x)
}
You can call this and see the results below:
> x <- rt(10,1) > x [1] 0.12159140 0.37272247 -0.02500885 -1.56863046 [5] -0.15048060 0.11176227 0.15118459 -1.54924705 [9] 0.05186935 -0.89555572 > mean(x) [1] -0.3379793 > trimmedmean(x) [1] -0.2729856 > trimmedmean(x,2) [1] -0.1309704 > x [1] 0.12159140 0.37272247 -0.02500885 -1.56863046 [5] -0.15048060 0.11176227 0.15118459 -1.54924705 [9] 0.05186935 -0.89555572 >Note that the parameter trim has a default value of 1, but can be specified to be 2 instead. Also note that x wasn't changed by the trimmedmean function: when you pass an argument to a function, it only gets a copy, so any changes it makes don't affect the original object.