Three Species of Iris Flowers

Three Species of Iris Flowers

Fisher (1936) used this data to illustrate his method of linear discriminant analysis. The data was collected in the Gaspé in Quebec and is comprised of 50 4-variate measurements on each of 3 Iris species (setosa, versicolor, virginica). The four measurements were the length and width of the petal and sepals of the flower. So there were a total of 150 observations in all.

Iris Petal and Sepal

Iris Petal and Sepal

This dataset is available in the datasets package with R so it is one of the “built-in” datasets. Here are the first four lines,

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa

The lattice boxplot display summarizes the distributions of each variable for each Iris species. In this display

Notice that simple using PL we can discriminate perfectly between setosa vs. the other two. But the differences of PL between the versicolor and viginica are statistically strong but the distributions of all four measurements slightly overlap, so a simple discrimination method is not evident.

Also notice there is a big difference in magnitudes of the sepal and petal measurements even though the unit of measurement is the same. We definitely should rescale (center and divide by scale estimate) these measurements when doing principal component analysis.

Figure 1. Lattice Boxplots of Fisher’s Iris Data

Figure 2 shows the boxplots of the principal components (PCs). Since there are 4 variables, we have 3 principal components. As can be seen visually from the boxplots the first PC has the largest variance while the last has the smallest.

Figure 3 shows the lower triangle of a scatterplot matrix of the PC’s. The points have been coloured to indicate the species. The diagonal panels are histograms of the PC’s. The display clearly reveals that the first PC perfectly discriminates between setosa and the other species.

Figure 3. Scatterplot Matrix, Principal Components, Iris Data