8  Data Frame Manipulation with dplyr

Manipulation of data frames means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
[1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
[1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
[1] 7902.15

But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

8.1 The dplyr package

Luckily, the dplyr package provides several very useful functions for manipulating data frames in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Tip: Tidyverse

dplyr package belongs to a broader family of opinionated R packages designed for data science called the “Tidyverse”. These packages are specifically designed to work harmoniously together. Some of these packages will be covered along this course, but you can find more complete information here: https://www.tidyverse.org/.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (|>) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()

You will need to install (if you have not already) and load the tidyverse package, which will also install dplyr, ggplot2, and other related packages:

install.packages('tidyverse')
library(tidyverse)

Recall that you only need to install a package once (so you should not include the install.packages() code in your code chunk – you can just run it directly in the console), but you need to load it every time you start a new R session.

8.2 Extracting columns using select()

If, for example, we wanted to move forward with only a few of the variables in our data frame we could use the select() function. The first argument is the data frame and the remaining arguments are the names of the columns that you want to keep.

year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)
  year     country gdpPercap
1 1952 Afghanistan  779.4453
2 1957 Afghanistan  820.8530
3 1962 Afghanistan  853.1007
4 1967 Afghanistan  836.1971
5 1972 Afghanistan  739.9811
6 1977 Afghanistan  786.1134

Diagram illustrating use of select function to select two columns of a data frame ::: {.callout-tip} ## Quotes vs no quotes

Unlike the base R code, df["var"], you usually do not need to enclose the variable names in quotes in most tidyverse functions: you can write select(df, var) rather than select(df, "var"). :::

You can also use the select() function to rename columns, for example, the following code will rename the gdpPercap column to gdp_per_cap while selecting only the year, country, and gdp_per_cap columns:

gapminder_selected <- select(gapminder, year, country, gdp_per_cap = gdpPercap)
head(gapminder_selected)
  year     country gdp_per_cap
1 1952 Afghanistan    779.4453
2 1957 Afghanistan    820.8530
3 1962 Afghanistan    853.1007
4 1967 Afghanistan    836.1971
5 1972 Afghanistan    739.9811
6 1977 Afghanistan    786.1134

If we want to remove one column only from the gapminder data, we can use the -var syntax. For example, the code below removes the continent column:

smaller_gapminder_data <- select(gapminder, -continent)

Above we used ‘normal’ grammar, which places the data frame in the first argument of the dplyr function select(), but the real strength of dplyr lies in combining several functions using pipes.

8.3 The pipe |>

Since the pipe’s grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

year_country_gdp <- gapminder |> select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk through it step by step.

First, we summon the gapminder data frame and pass it on, using the pipe symbol |>, to the first argument in the function on the right-hand side of the pipe, which is the select() function.

So the two pieces of code are equivalent.

select(gapminder, year)
gapminder |> select(year)
Two pipes: |> vs %>%

You might sometimes see the pipe written as %>%, rather than |>.

%>% is the “old” version of the pipe that was first introduced in the magrittr R package (and is also loaded with the dplyr package).

Recently, a native pipe |> has been introduced to the base R language, and moving forward it is this new native pipe that we will use. The two pipes behave very similarly, with only minor differences (such as |> requires parentheses for functions with no arguments).

The pipe doesn’t only work with select(). The pipe can be used for every R function. For example, the following are equivalent:

head(gapminder)
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134
gapminder |> head()
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

So are these two:

mean(c(1, 5, 3))
[1] 3
c(1, 5, 3) |> mean()
[1] 3
Challenge 1

Rewrite the following code using the pipe:

select(gapminder, -continent)
head(select(gapminder, life_exp = lifeExp, gdp_per_cap = gdpPercap))
gapminder |> select(-continent)
gapminder |>
  select(life_exp = lifeExp, gdp_per_cap = gdpPercap) |> 
  head()
Tip: Renaming data frame columns in dplyr

In an earlier lesson, we covered how you can rename columns with base R by assigning a value to the output of the names() function. While this works just fine, dplyr has a rename() function that makes this even easier.

Within a pipeline, the syntax is rename(new_name = old_name). For example, we may want to rename the gdpPercap column name from our select() statement above.

tidy_gdp <- year_country_gdp |> rename(gdp_per_capita = gdpPercap)
head(tidy_gdp)
  year     country gdp_per_capita
1 1952 Afghanistan       779.4453
2 1957 Afghanistan       820.8530
3 1962 Afghanistan       853.1007
4 1967 Afghanistan       836.1971
5 1972 Afghanistan       739.9811
6 1977 Afghanistan       786.1134
Tip: rename() vs select()

We saw above that we can use select() to rename columns, but this also requires us to specify which columns to include in the output.

gapminder |>
  select(gdp_per_capita = gdpPercap) |>
  head()
  gdp_per_capita
1       779.4453
2       820.8530
3       853.1007
4       836.1971
5       739.9811
6       786.1134

rename() however, will automatically return all columns, while renaming the specified column:

gapminder |>
  rename(gdp_per_capita = gdpPercap) |>
  head()
      country year      pop continent lifeExp gdp_per_capita
1 Afghanistan 1952  8425333      Asia  28.801       779.4453
2 Afghanistan 1957  9240934      Asia  30.332       820.8530
3 Afghanistan 1962 10267083      Asia  31.997       853.1007
4 Afghanistan 1967 11537966      Asia  34.020       836.1971
5 Afghanistan 1972 13079460      Asia  36.088       739.9811
6 Afghanistan 1977 14880372      Asia  38.438       786.1134

8.4 Filter to rows using filter()

If we only want to look at the European countries in gapminder, we can use the filter() function:

gapminder |> filter(continent == "Europe") |> head()
  country year     pop continent lifeExp gdpPercap
1 Albania 1952 1282697    Europe   55.23  1601.056
2 Albania 1957 1476505    Europe   59.28  1942.284
3 Albania 1962 1728137    Europe   64.82  2312.889
4 Albania 1967 1984060    Europe   66.22  2760.197
5 Albania 1972 2263554    Europe   67.69  3313.422
6 Albania 1977 2509048    Europe   68.93  3533.004

We can even chain together select() and filter():

year_country_gdp_euro <- gapminder |>
  filter(continent == "Europe") |>
  select(year, country, gdpPercap)
head(year_country_gdp_euro)
  year country gdpPercap
1 1952 Albania  1601.056
2 1957 Albania  1942.284
3 1962 Albania  2312.889
4 1967 Albania  2760.197
5 1972 Albania  3313.422
6 1977 Albania  3533.004

You can supply multiple conditions to filter() using a comma. This is equivalent to “and” &, as only rows where both conditions are satisfied will be returned.

For example, if we now want to show the life expectancy of European countries but only for a specific year (e.g., 2007), we can do as below.

europe_lifeExp_2007 <- gapminder |>
  filter(continent == "Europe", year == 2007) |>
  select(country, lifeExp)
Challenge 2

Use piping to produce a data frame that contains values for lifeExp, country and year for Africa but not for other Continents.

Your data frame should have 624 rows and 3 columns (you can check this with dim()).

year_country_lifeExp_Africa <- gapminder |>
  filter(continent == "Africa") |>
  select(year, country, lifeExp)
head(year_country_lifeExp_Africa)
  year country lifeExp
1 1952 Algeria  43.077
2 1957 Algeria  45.685
3 1962 Algeria  48.303
4 1967 Algeria  51.407
5 1972 Algeria  54.518
6 1977 Algeria  58.014
dim(year_country_lifeExp_Africa)
[1] 624   3

In the challenge above, first, we pass the gapminder data frame to the filter() function, then we pass the filtered version of the gapminder data frame to the select() function. Note: The order of operations is very important in this case. If we used select() first, filter() would not be able to find the variable continent since we would have removed it in the previous step.

8.5 Adding and modifying columns using mutate()

You can add new columns as well as modify existing columns in a data frame using mutate().

For example, the following code will create a new column called gdp_billion which contains the product of gdpPercap and pop columns and divides by 1 billion.

gapminder |>
  mutate(gdp_billion = gdpPercap * pop / 10^9) |>
  head()
      country year      pop continent lifeExp gdpPercap gdp_billion
1 Afghanistan 1952  8425333      Asia  28.801  779.4453    6.567086
2 Afghanistan 1957  9240934      Asia  30.332  820.8530    7.585449
3 Afghanistan 1962 10267083      Asia  31.997  853.1007    8.758856
4 Afghanistan 1967 11537966      Asia  34.020  836.1971    9.648014
5 Afghanistan 1972 13079460      Asia  36.088  739.9811    9.678553
6 Afghanistan 1977 14880372      Asia  38.438  786.1134   11.697659

You don’t have to always create a new variable with mutate()– you can instead modify an existing variable. For example, the code below replaces the lifeExp column with the rounded version of lifeExp:

gapminder |>
  mutate(lifeExp = round(lifeExp)) |>
  head()
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia      29  779.4453
2 Afghanistan 1957  9240934      Asia      30  820.8530
3 Afghanistan 1962 10267083      Asia      32  853.1007
4 Afghanistan 1967 11537966      Asia      34  836.1971
5 Afghanistan 1972 13079460      Asia      36  739.9811
6 Afghanistan 1977 14880372      Asia      38  786.1134
Combining mutate() and select() with transmute()

Sometimes you want to both mutate and select at the same time. Rather than doing this in two separate steps, you can do it in one using transmute().

The following two code chunks are equivalent:

gapminder |>
  mutate(gdp_billion = gdpPercap * pop / 10^9) |>
  select(country, year, gdp_billion) |>
  head()
      country year gdp_billion
1 Afghanistan 1952    6.567086
2 Afghanistan 1957    7.585449
3 Afghanistan 1962    8.758856
4 Afghanistan 1967    9.648014
5 Afghanistan 1972    9.678553
6 Afghanistan 1977   11.697659
gapminder |>
  transmute(country, year, gdp_billion = gdpPercap * pop / 10^9) |>
  head()
      country year gdp_billion
1 Afghanistan 1952    6.567086
2 Afghanistan 1957    7.585449
3 Afghanistan 1962    8.758856
4 Afghanistan 1967    9.648014
5 Afghanistan 1972    9.678553
6 Afghanistan 1977   11.697659
Challenge 3

Explain what the following code is trying to do and explain why it doesn’t work. Re-write the code so that it does work:

gapminder |>
  filter(country == "Australia") |>
  select(year, lifeExp = round(lifeExp))
Error in `select()`:
! Problem while evaluating `round(lifeExp)`.
Caused by error:
! object 'lifeExp' not found

The code doesn’t work because you can’t make modifications to the data inside a select() function.

You could fix this in a few ways. The most efficient is using transmute():

gapminder |>
  filter(country == "Australia") |>
  transmute(year, lifeExp = round(lifeExp))
   year lifeExp
1  1952      69
2  1957      70
3  1962      71
4  1967      71
5  1972      72
6  1977      73
7  1982      75
8  1987      76
9  1992      78
10 1997      79
11 2002      80
12 2007      81

But you could also use mutate() and select() separately

gapminder |>
  filter(country == "Australia") |>
  select(year, lifeExp) |>
  mutate(lifeExp = round(lifeExp)) 
   year lifeExp
1  1952      69
2  1957      70
3  1962      71
4  1967      71
5  1972      72
6  1977      73
7  1982      75
8  1987      76
9  1992      78
10 1997      79
11 2002      80
12 2007      81

8.6 Creating summaries using summarize()

You can easily compute a “summary” of a column using summarize(). For example, the following code computes the max life expectancy in the data:

gapminder |>
  summarize(max_life_exp = max(lifeExp))
  max_life_exp
1       82.603

You can even compute multiple summaries at once, e.g., computing both the max and minimum of the lifeExp variable:

gapminder |>
  summarize(max_life_exp = max(lifeExp),
            min_life_exp = min(lifeExp))
  max_life_exp min_life_exp
1       82.603       23.599

8.7 Grouped computations with group_by()

Now, we were supposed to be reducing the error-prone repetitiveness of what can be done with base R, but up to now, we haven’t done that since we would have to repeat the above for each continent.

One way to compute the mean lifeExp for each continent using dplyr functions would be:

gapminder |> filter(continent == "Europe") |> summarise(mean(lifeExp))
  mean(lifeExp)
1      71.90369
gapminder |> filter(continent == "Africa") |> summarise(mean(lifeExp))
  mean(lifeExp)
1      48.86533
gapminder |> filter(continent == "Asia") |> summarise(mean(lifeExp))
  mean(lifeExp)
1       60.0649
gapminder |> filter(continent == "Americas") |> summarise(mean(lifeExp))
  mean(lifeExp)
1      64.65874
gapminder |> filter(continent == "Oceania") |> summarise(mean(lifeExp))
  mean(lifeExp)
1      74.32621

But this is very repetitive and is not in line with the tidyverse way of doing things!

The group_by() function offers a nice alternative approach for grouped operations on a data frame:

gapminder |>
  group_by(continent) |>
  summarize(mean(lifeExp)) 
# A tibble: 5 × 2
  continent `mean(lifeExp)`
  <chr>               <dbl>
1 Africa               48.9
2 Americas             64.7
3 Asia                 60.1
4 Europe               71.9
5 Oceania              74.3

The group_by() function essentially splits the data frame into separate data frames to perform the computation, and then joins the result back into a single data frame.

Diagram illustrating how the group by function oraganizes a data frame into groups

Moreover, group_by() doesn’t work just with summarize. Can you figure out what the following code is doing?

gapminder |>
  group_by(continent) |>
  mutate(mean_lifeExp = mean(lifeExp)) |>
  ungroup()
# A tibble: 1,704 × 7
   country      year      pop continent lifeExp gdpPercap mean_lifeExp
   <chr>       <int>    <dbl> <chr>       <dbl>     <dbl>        <dbl>
 1 Afghanistan  1952  8425333 Asia         28.8      779.         60.1
 2 Afghanistan  1957  9240934 Asia         30.3      821.         60.1
 3 Afghanistan  1962 10267083 Asia         32.0      853.         60.1
 4 Afghanistan  1967 11537966 Asia         34.0      836.         60.1
 5 Afghanistan  1972 13079460 Asia         36.1      740.         60.1
 6 Afghanistan  1977 14880372 Asia         38.4      786.         60.1
 7 Afghanistan  1982 12881816 Asia         39.9      978.         60.1
 8 Afghanistan  1987 13867957 Asia         40.8      852.         60.1
 9 Afghanistan  1992 16317921 Asia         41.7      649.         60.1
10 Afghanistan  1997 22227415 Asia         41.8      635.         60.1
# … with 1,694 more rows
Tip: ungrouping

Note that it is good practice to follow any grouped operation with ungroup() since this prevents further downstream operations from also being grouped.

Tip: tibbles

You might have noticed that the object above is a “tibble”. A tibble is the tidyverse’s version of a data frame. For our purposes, tibbles and data frames are completely equivalent. It is only once you get into very advanced tidyverse programming will you start to see the features that tibbles possess that data frames don’t, such as list columns.

8.7.1 Grouping by multiple variables

The function group_by() allows us to group by multiple variables. For example, let’s group by year and continent and compute the mean gdpPercap for each year-continent combination:

gapminder |>
  group_by(continent, year) |>
  summarize(mean_gdpPercap = mean(gdpPercap)) |> 
  ungroup() |>
  head()
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
  continent  year mean_gdpPercap
  <chr>     <int>          <dbl>
1 Africa     1952          1253.
2 Africa     1957          1385.
3 Africa     1962          1598.
4 Africa     1967          2050.
5 Africa     1972          2340.
6 Africa     1977          2586.
Challenge 4

Use group_by(), summarise(), and filter() to identify which country has the longest average life expectancy and which has the shortest average life expectancy.

One way to do this is:

gapminder |>
  group_by(country) |>
  summarize(mean_lifeExp = mean(lifeExp)) |>
  ungroup() |>
  filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
# A tibble: 2 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Iceland              76.5
2 Sierra Leone         36.8

8.8 Ordering the rows in a data frame using arrange()

Another way to complete the previous challenge is to use the dplyr function arrange(), which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has a similar syntax to other functions from the dplyr package. You can use desc() inside arrange() to sort in descending order.

First, we can compute the average life expectancy for each country

lifeExp_bycountry <- gapminder |>
  group_by(country) |>
  summarise(mean_lifeExp = mean(lifeExp)) |>
  ungroup()

Then we can arrange the data frame in order of the average life expectancy for each country to view the countries with the lowest average life expectancy

lifeExp_bycountry |>
  arrange(mean_lifeExp) |>
  head()
# A tibble: 6 × 2
  country       mean_lifeExp
  <chr>                <dbl>
1 Sierra Leone          36.8
2 Afghanistan           37.5
3 Angola                37.9
4 Guinea-Bissau         39.2
5 Mozambique            40.4
6 Somalia               41.0

To view the countries with the highest life expectancy, we can use the desc() function inside arrange() (or we can view the tail() of the previous data frame):

lifeExp_bycountry |>
  arrange(desc(mean_lifeExp)) |>
  head()
# A tibble: 6 × 2
  country     mean_lifeExp
  <chr>              <dbl>
1 Iceland             76.5
2 Sweden              76.2
3 Norway              75.8
4 Netherlands         75.6
5 Switzerland         75.6
6 Canada              74.9
Arranging by a character variable

Applying arrange() to a character variable will present the rows in alphabetical order.

8.9 count() and n()

A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort=TRUE:

gapminder |>
  filter(year == 2002) |>
  count(continent, sort = TRUE)
  continent  n
1    Africa 52
2      Asia 33
3    Europe 30
4  Americas 25
5   Oceania  2

If we need to use the number of observations/rows in calculations, the n() function is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column.

For instance, if we wanted to get the standard error of the life expectancy per continent:

gapminder |>
  group_by(continent) |>
  summarize(se_le = sd(lifeExp) / sqrt(n()))
# A tibble: 5 × 2
  continent se_le
  <chr>     <dbl>
1 Africa    0.366
2 Americas  0.540
3 Asia      0.596
4 Europe    0.286
5 Oceania   0.775
Advanced Challenge

Calculate the total GDP in billions (gdpPercap * pop / 1e9) for each continent in the year 2002 and 2007. Which continent had the largest GDP in 2002? which had the smallest? What about in 2007?

One way to answer this question is:

gapminder |>
  filter(year %in% c(2002, 2007)) |>
  mutate(gdp = gdpPercap * pop / 1e9) |>
  group_by(year, continent) |>
  summarize(total_gdp = sum(gdp))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 10 × 3
# Groups:   year [2]
    year continent total_gdp
   <int> <chr>         <dbl>
 1  2002 Africa        1836.
 2  2002 Americas     16531.
 3  2002 Asia         15115.
 4  2002 Europe       13093.
 5  2002 Oceania        690.
 6  2007 Africa        2380.
 7  2007 Americas     19418.
 8  2007 Asia         20708.
 9  2007 Europe       14795.
10  2007 Oceania        807.

8.10 Other great resources