13  Reshaping data frames with tidyr

Researchers often want to reshape their data frames from ‘wide’ to ‘longer’ layouts, or vice-versa.

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time or observations of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format.

However, many of R‘s functions have been designed assuming you have a ’tidy’ formatted dataset, in which

The “wide”-labeled format in the example below is actually tidy:

Diagram illustrating the difference between a tidy versus long layout of a data frame

13.1 Getting started

The tools we will be using in this lesson come from the “tidyr” package, which is part of the tidyverse suite:

Load the tidyverse package

library(tidyverse)

First, let’s look at the structure of our original gapminder data frame:

head(gapminder)
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134
Challenge 1

Is gapminder in a purely long format, a purely wide format, or is it in a tidy format?

The original gapminder data frame is in a tidy format.

A long format of gapminder would instead look like this:

       country year continent  variable        value
1  Afghanistan 1952      Asia       pop 8.425333e+06
2  Afghanistan 1952      Asia   lifeExp 2.880100e+01
3  Afghanistan 1952      Asia gdpPercap 7.794453e+02
4  Afghanistan 1957      Asia       pop 9.240934e+06
5  Afghanistan 1957      Asia   lifeExp 3.033200e+01
6  Afghanistan 1957      Asia gdpPercap 8.208530e+02
7  Afghanistan 1962      Asia       pop 1.026708e+07
8  Afghanistan 1962      Asia   lifeExp 3.199700e+01
9  Afghanistan 1962      Asia gdpPercap 8.531007e+02
10 Afghanistan 1967      Asia       pop 1.153797e+07

Can you identify the difference?

Note: Some ggplots are easier to create using long-format data, but generally you want your data to be in a tidy format.

13.2 Pivoting longer with pivot_longer()

To convert the original gapminder data to the longer format, we will use the pivot_longer() function.

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns, or ‘lengthening’ your observation variables into a single variable.

Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

gapminder_long <- gapminder |>
  pivot_longer(cols = c(pop, lifeExp, gdpPercap),
               names_to = "variable", 
               values_to = "value")
gapminder_long
# A tibble: 5,112 × 5
   country      year continent variable       value
   <chr>       <int> <chr>     <chr>          <dbl>
 1 Afghanistan  1952 Asia      pop        8425333  
 2 Afghanistan  1952 Asia      lifeExp         28.8
 3 Afghanistan  1952 Asia      gdpPercap      779. 
 4 Afghanistan  1957 Asia      pop        9240934  
 5 Afghanistan  1957 Asia      lifeExp         30.3
 6 Afghanistan  1957 Asia      gdpPercap      821. 
 7 Afghanistan  1962 Asia      pop       10267083  
 8 Afghanistan  1962 Asia      lifeExp         32.0
 9 Afghanistan  1962 Asia      gdpPercap      853. 
10 Afghanistan  1967 Asia      pop       11537966  
# … with 5,102 more rows

We first provide to pivot_longer() a vector of column names that will be used to pivot into a longer format.

Note: pivot_longer() also allows the alternative syntax of using the - symbol to identify which variables are not to be pivoted (i.e. ID variables).

The next arguments to pivot_longer() are names_to for naming the column that will contain the variable names (variable) and values_to for naming the column that will contain the values of each variable. We supply these new column names as strings.

Typically, long-format is useful when you want to plot multiple variables on the same plot, e.g., a line plot of each of the three variables (pop, lifeExp, and gdpPercap) over time.

gapminder_long |>
  # filter to just the data for the US
  filter(country == "United States") |>
  # scale each variable's value to a common scale between 0 (min) and 1
  group_by(variable) |>
  mutate(value_scaled = value / sd(value)) |>
  ungroup() |>
  # plot a line plot of each variable over time
  ggplot() +
  geom_line(aes(x = year, y = value_scaled, color = variable))

Challenge 2

Using gapminder_long, calculate the mean life expectancy, population, and gdpPercap for each continent.

*Hint:** use the group_by() and summarize() functions we learned in the dplyr lesson

gapminder_long |> 
  group_by(continent, variable) |>
  summarize(means = mean(value))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 15 × 3
# Groups:   continent [5]
   continent variable       means
   <chr>     <chr>          <dbl>
 1 Africa    gdpPercap     2194. 
 2 Africa    lifeExp         48.9
 3 Africa    pop        9916003. 
 4 Americas  gdpPercap     7136. 
 5 Americas  lifeExp         64.7
 6 Americas  pop       24504795. 
 7 Asia      gdpPercap     7902. 
 8 Asia      lifeExp         60.1
 9 Asia      pop       77038722. 
10 Europe    gdpPercap    14469. 
11 Europe    lifeExp         71.9
12 Europe    pop       17169765. 
13 Oceania   gdpPercap    18622. 
14 Oceania   lifeExp         74.3
15 Oceania   pop        8874672. 

13.3 Pivoting wider with pivot_wider()

Let’s convert our long format data back to the (wider) tidy format using pivot_wider().

pivot_wider() is the opposite of pivot_longer(), making a dataset wider by increasing the number of columns and decreasing the number of rows.

The pivot_wider() function takes names_from and values_from arguments.

To names_from we supply the column name whose contents will be pivoted into new output columns in the widened data frame. The corresponding values will be added from the column named in the values_from argument.

gapminder_wider <- gapminder_long |>
  pivot_wider(names_from = variable, values_from = value)
head(gapminder_wider)
# A tibble: 6 × 6
  country      year continent      pop lifeExp gdpPercap
  <chr>       <int> <chr>        <dbl>   <dbl>     <dbl>
1 Afghanistan  1952 Asia       8425333    28.8      779.
2 Afghanistan  1957 Asia       9240934    30.3      821.
3 Afghanistan  1962 Asia      10267083    32.0      853.
4 Afghanistan  1967 Asia      11537966    34.0      836.
5 Afghanistan  1972 Asia      13079460    36.1      740.
6 Afghanistan  1977 Asia      14880372    38.4      786.
dim(gapminder_wider)
[1] 1704    6

Which is the same as our original gapminder data (but with some of the columns in a different order)

head(gapminder)
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134
dim(gapminder)
[1] 1704    6
Challenge 3

Is the mtcars dataset in a tidy format?

mtcars_data <- read.csv("data/mtcars.csv") 

Convert mtcars to a longer format using pivot_longer() (Hint: the car variable is an ID variable and should not be pivoted).

Use pivot_wider() to convert your result back to the original mtcars dataset.

mtcars_long <- mtcars_data |>
  pivot_longer(-car)
mtcars_long
# A tibble: 352 × 3
   car       name   value
   <chr>     <chr>  <dbl>
 1 Mazda RX4 mpg    21   
 2 Mazda RX4 cyl     6   
 3 Mazda RX4 disp  160   
 4 Mazda RX4 hp    110   
 5 Mazda RX4 drat    3.9 
 6 Mazda RX4 wt      2.62
 7 Mazda RX4 qsec   16.5 
 8 Mazda RX4 vs      0   
 9 Mazda RX4 am      1   
10 Mazda RX4 gear    4   
# … with 342 more rows
mtcars_wide <- mtcars_long |>
  pivot_wider(names_from = "name", values_from = "value")
mtcars_wide
# A tibble: 32 × 12
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

13.4 Other great resources