Introduction

forcats <=> FactOR CATegorical variableS

For more historical context on factors, I recommend stringsAsFactors: An unauthorized biography by Roger Peng, and stringsAsFactors = <sigh> by Thomas Lumley.

Remark: I think base::ordered() is only useful with R modelling functions such as aov(). Perhaps some others.

All forcats functions begin with fct_.

ls("package:forcats") %>% str_subset(., "^fct_")
##  [1] "fct_anon"        "fct_c"           "fct_collapse"   
##  [4] "fct_count"       "fct_drop"        "fct_expand"     
##  [7] "fct_explicit_na" "fct_infreq"      "fct_inorder"    
## [10] "fct_lump"        "fct_other"       "fct_recode"     
## [13] "fct_relabel"     "fct_relevel"     "fct_reorder"    
## [16] "fct_reorder2"    "fct_rev"         "fct_shift"      
## [19] "fct_shuffle"     "fct_unify"       "fct_unique"

Simple motivating example

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

  1. There are only twelve possible months, and there’s nothing saving you from typos:

    x2 <- c("Dec", "Apr", "Jam", "Mar")
  2. It doesn’t sort in a useful way:

    sort(x1)
    ## [1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Now you can create a factor:

y1 <- factor(x1, levels = month_levels)
y1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

And any values not in the set will be silently converted to NA:

y2 <- factor(x2, levels = month_levels)
y2
## [1] Dec  Apr  <NA> Mar 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

If you omit the levels, they’ll be taken from the data in alphabetical order:

factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the fact, with fct_inorder():

base::unique(c(10,3,3,7,10))
## [1] 10  3  7
f1 <- factor(x1, levels = unique(x1))
f1
## [1] Dec Apr Jan Mar
## Levels: Dec Apr Jan Mar
x1 %>% factor()
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
f2 <- x1 %>% factor() %>% fct_inorder()
f2
## [1] Dec Apr Jan Mar
## Levels: Dec Apr Jan Mar

If you ever need to access the set of valid levels directly, you can do so with levels():

levels(f2)
## [1] "Dec" "Apr" "Jan" "Mar"

General Social Survey: forcats::gss_cat

forcats::gss_cat: sample of data from the General Social Survey

gss_cat
## # A tibble: 21,483 x 9
##     year marital         age race  rincome  partyid  relig  denom  tvhours
##    <int> <fct>         <int> <fct> <fct>    <fct>    <fct>  <fct>    <int>
##  1  2000 Never married    26 White $8000 t~ Ind,nea~ Prote~ South~      12
##  2  2000 Divorced         48 White $8000 t~ Not str~ Prote~ Bapti~      NA
##  3  2000 Widowed          67 White Not app~ Indepen~ Prote~ No de~       2
##  4  2000 Never married    39 White Not app~ Ind,nea~ Ortho~ Not a~       4
##  5  2000 Divorced         25 White Not app~ Not str~ None   Not a~       1
##  6  2000 Married          25 White $20000 ~ Strong ~ Prote~ South~      NA
##  7  2000 Never married    36 White $25000 ~ Not str~ Chris~ Not a~       3
##  8  2000 Divorced         44 White $7000 t~ Ind,nea~ Prote~ Luthe~      NA
##  9  2000 Married          44 White $25000 ~ Not str~ Prote~ Other        0
## 10  2000 Married          47 White $25000 ~ Strong ~ Prote~ South~       3
## # ... with 21,473 more rows

One way to see them is with count():

gss_cat %>%
  count(race)
## # A tibble: 3 x 2
##   race      n
##   <fct> <int>
## 1 Other  1959
## 2 Black  3129
## 3 White 16395
base::table(gss_cat$race)
## 
##          Other          Black          White Not applicable 
##           1959           3129          16395              0

Bar chart:

ggplot(gss_cat, aes(race)) +
  geom_bar()

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn’t yet have a drop option, but it will in the future.

When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.

Distribution of rincome (reported income)

rincome_plot <-
  gss_cat %>%
  ggplot(aes(rincome)) +
  geom_bar()
rincome_plot

Put labels on angle.

rincome_plot +
  theme(axis.text.x = element_text(angle = 45))

Rotate plot is better.

rincome_plot +
  coord_flip()

Five most common religions.

gss_cat %>%
  count(relig) %>%
  arrange(-n) %>%
  head(10)
## # A tibble: 10 x 2
##    relig                       n
##    <fct>                   <int>
##  1 Protestant              10846
##  2 Catholic                 5124
##  3 None                     3523
##  4 Christian                 689
##  5 Jewish                    388
##  6 Other                     224
##  7 Buddhism                  147
##  8 Inter-nondenominational   109
##  9 Moslem/islam              104
## 10 Orthodox-christian         95

Modifying factor order:

It’s often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

relig_summary
## # A tibble: 15 x 4
##    relig                     age tvhours     n
##    <fct>                   <dbl>   <dbl> <int>
##  1 No answer                49.5    2.72    93
##  2 Don't know               35.9    4.62    15
##  3 Inter-nondenominational  40.0    2.87   109
##  4 Native american          38.9    3.46    23
##  5 Christian                40.1    2.79   689
##  6 Orthodox-christian       50.4    2.42    95
##  7 Moslem/islam             37.6    2.44   104
##  8 Other eastern            45.9    1.67    32
##  9 Hinduism                 37.7    1.89    71
## 10 Buddhism                 44.7    2.38   147
## 11 Other                    41.0    2.73   224
## 12 None                     41.2    2.71  3523
## 13 Jewish                   52.4    2.52   388
## 14 Catholic                 46.9    2.96  5124
## 15 Protestant               49.9    3.15 10846
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

It is difficult to interpret this plot because there’s no overall pattern.

fct_reorder()

We can improve it by reordering the levels of relig using fct_reorder().

fct_reorder() takes three arguments:

  • f, the factor whose levels you want to modify.
  • x, a numeric vector that you want to use to reorder the levels.
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

Suggest: Move transformations out of aes() and into a separate mutate() step.

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

What if we create a similar plot looking at how average age varies across reported income level?

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

rincome_summary 
## # A tibble: 16 x 4
##    rincome          age tvhours     n
##    <fct>          <dbl>   <dbl> <int>
##  1 No answer       45.5    2.90   183
##  2 Don't know      45.6    3.41   267
##  3 Refused         47.6    2.48   975
##  4 $25000 or more  44.2    2.23  7363
##  5 $20000 - 24999  41.5    2.78  1283
##  6 $15000 - 19999  40.0    2.91  1048
##  7 $10000 - 14999  41.1    3.02  1168
##  8 $8000 to 9999   41.1    3.15   340
##  9 $7000 to 7999   38.2    2.65   188
## 10 $6000 to 6999   40.3    3.17   215
## 11 $5000 to 5999   37.8    3.16   227
## 12 $4000 to 4999   38.9    3.15   226
## 13 $3000 to 3999   37.8    3.31   276
## 14 $1000 to 2999   34.5    3.00   395
## 15 Lt $1000        40.5    3.36   286
## 16 Not applicable  56.1    3.79  7043
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()

principled order VS arbitrary order

Here, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with.

Reserve fct_reorder() for factors whose levels are arbitrarily ordered.

fct_relevel()

However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use fct_relevel().

fct_relevel(): takes a factor, f, and then any number of levels that you want to move to the front of the line.

levels(rincome_summary$rincome)
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"
levels(fct_relevel(rincome_summary$rincome, "Not applicable"))
##  [1] "Not applicable" "No answer"      "Don't know"     "Refused"       
##  [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"
##  [9] "$8000 to 9999"  "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999" 
## [13] "$4000 to 4999"  "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

fct_reorder2()

fct_reorder2() is useful when you are colouring the lines on a plot.

fct_reorder2() reorders the factor by the y values associated with the largest x values.

This makes the plot easier to read because the line colours line up with the legend.

#Correction A. I. McLeod, Nov 19, 2018
#ERROR. I think count() doesn't work correctly with grouped tibbles!
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  group_by(age, marital) %>%
  count() %>% #count() does not work correctly with grouped tibbles!
  mutate(prop = n / sum(n)) 
by_age
## # A tibble: 351 x 4
## # Groups:   age, marital [351]
##      age marital           n  prop
##    <int> <fct>         <int> <dbl>
##  1    18 Never married    89    1.
##  2    18 Married           2    1.
##  3    19 Never married   234    1.
##  4    19 Divorced          3    1.
##  5    19 Widowed           1    1.
##  6    19 Married          11    1.
##  7    20 Never married   227    1.
##  8    20 Separated         1    1.
##  9    20 Divorced          2    1.
## 10    20 Married          21    1.
## # ... with 341 more rows
#CORRECTION
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  group_by(age, marital) %>%
  summarize(n=n()) %>%
  mutate(prop =  n/sum(n))
by_age
## # A tibble: 351 x 4
## # Groups:   age [72]
##      age marital           n    prop
##    <int> <fct>         <int>   <dbl>
##  1    18 Never married    89 0.978  
##  2    18 Married           2 0.0220 
##  3    19 Never married   234 0.940  
##  4    19 Divorced          3 0.0120 
##  5    19 Widowed           1 0.00402
##  6    19 Married          11 0.0442 
##  7    20 Never married   227 0.904  
##  8    20 Separated         1 0.00398
##  9    20 Divorced          2 0.00797
## 10    20 Married          21 0.0837 
## # ... with 341 more rows
ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE) +
  ggtitle("colours DO NOT line with legend")

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line(na.rm = TRUE) +
  labs(colour = "marital") +
  ggtitle("colours DO line with legend")

#Better coding style
by_age %>%
  mutate(marital = fct_reorder2(marital, age, prop)) %>%
  ggplot(aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE) +
  labs(colour = "marital") +
  ggtitle("colours DO line with legend. Better coding style.")

fct_infreq() for Pareto chart-like ordering

For bar plots, you can use fct_infreq() to order levels in increasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. You may want to combine with fct_rev().

gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()

#My preference
gss_cat %>%
  mutate(marital = marital %>% fct_infreq()) %>% 
  ggplot(aes(marital)) +
    geom_bar() +
  ggtitle("I prefer Pareto-like style")

Exploring gss_cat factors: arbitrary vs. principled

gss_cat %>% purrr::map(is.factor) %>% names()
## [1] "year"    "marital" "age"     "race"    "rincome" "partyid" "relig"  
## [8] "denom"   "tvhours"

There are five six categorical variables: marital, race, rincome, partyid, relig, denom.

There is no obvious pattern to marital.

gss_cat %>%
  ggplot(aes(x = marital)) +
  geom_bar()

The ordering of race is principled in that the categories are ordered by count of observations in the data.

gss_cat %>%
  ggplot(aes(race)) +
    geom_bar()

Modifying factor levels

More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.

fct_recode()

The most general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level.

For example, take the gss_cat$partyid:

gss_cat %>% count(partyid)
## # A tibble: 10 x 2
##    partyid                n
##    <fct>              <int>
##  1 No answer            154
##  2 Don't know             1
##  3 Other party          393
##  4 Strong republican   2314
##  5 Not str republican  3032
##  6 Ind,near rep        1791
##  7 Independent         4119
##  8 Ind,near dem        2499
##  9 Not str democrat    3690
## 10 Strong democrat     3490

The levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction.

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
## # A tibble: 10 x 2
##    partyid                   n
##    <fct>                 <int>
##  1 No answer               154
##  2 Don't know                1
##  3 Other party             393
##  4 Republican, strong     2314
##  5 Republican, weak       3032
##  6 Independent, near rep  1791
##  7 Independent            4119
##  8 Independent, near dem  2499
##  9 Democrat, weak         3690
## 10 Democrat, strong       3490

fct_recode() will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level:

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
## # A tibble: 8 x 2
##   partyid                   n
##   <fct>                 <int>
## 1 Other                   548
## 2 Republican, strong     2314
## 3 Republican, weak       3032
## 4 Independent, near rep  1791
## 5 Independent            4119
## 6 Independent, near dem  2499
## 7 Democrat, weak         3690
## 8 Democrat, strong       3490

You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.

fct_collapse()

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode().

For each new variable, you can provide a vector of old levels:

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
## # A tibble: 4 x 2
##   partyid     n
##   <fct>   <int>
## 1 other     548
## 2 rep      5346
## 3 ind      8409
## 4 dem      7180

fct_lump()

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump().

For example,

gss_cat %>%
  count(relig) #see 15
## # A tibble: 15 x 2
##    relig                       n
##    <fct>                   <int>
##  1 No answer                  93
##  2 Don't know                 15
##  3 Inter-nondenominational   109
##  4 Native american            23
##  5 Christian                 689
##  6 Orthodox-christian         95
##  7 Moslem/islam              104
##  8 Other eastern              32
##  9 Hinduism                   71
## 10 Buddhism                  147
## 11 Other                     224
## 12 None                     3523
## 13 Jewish                    388
## 14 Catholic                 5124
## 15 Protestant              10846
gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig) #now 2
## # A tibble: 2 x 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Other      10637

The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. In this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’ve probably over collapsed.

Instead, we can use the n parameter to specify how many groups (excluding other) we want to keep:

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf) #don't need print() in this example
## # A tibble: 10 x 2
##    relig                       n
##    <fct>                   <int>
##  1 Protestant              10846
##  2 Catholic                 5124
##  3 None                     3523
##  4 Christian                 689
##  5 Other                     458
##  6 Jewish                    388
##  7 Buddhism                  147
##  8 Inter-nondenominational   109
##  9 Moslem/islam              104
## 10 Orthodox-christian         95

How have party affiliations evolved

levels(gss_cat$partyid)
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"

We need to combine these.

gss_cat %>%
  mutate(partyid =
           fct_collapse(partyid,
                        other = c("No answer", "Don't know", "Other party"),
                        rep = c("Strong republican", "Not str republican"),
                        ind = c("Ind,near rep", "Independent", "Ind,near dem"),
                        dem = c("Not str democrat", "Strong democrat"))) %>%
  count(year, partyid)  %>%
  group_by(year) %>%
  mutate(p = n / sum(n)) %>%
  ggplot(aes(x = year, y = p, colour = fct_reorder2(partyid, year, p))) +
  geom_point() +
  geom_line() +
  labs(colour = "Party ID.")

Combine rincome categories to simplify it

Group all the non-responses into one category, and then group other categories into a smaller number. Since there is a clear ordering, we would not use fct_lump().`

levels(gss_cat$rincome)
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"
library("stringr")
gss_cat %>%
  mutate(rincome =
           fct_collapse(
             rincome,
             `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
             `Lt $5000` = c("Lt $1000", 
                            str_c("$", c("1000", "3000", "4000"),
                                 " to ", c("2999", "3999", "4999"))),
             `$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"),
                                  " to ", c("5999", "6999", "7999", "9999"))
           )) %>%
  ggplot(aes(x = rincome)) +
  geom_bar() +
  coord_flip()