mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
[1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
[1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
[1] 7902.15
Manipulation of data frames means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
[1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
[1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
[1] 7902.15
But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.
dplyr
packageLuckily, the dplyr
package provides several very useful functions for manipulating data frames in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr
grammar easier to read.
Here we’re going to cover 5 of the most commonly used functions as well as using pipes (|>
) to combine them.
select()
filter()
group_by()
summarize()
mutate()
You will need to install (if you have not already) and load the tidyverse package, which will also install dplyr, ggplot2, and other related packages:
install.packages('tidyverse')
library(tidyverse)
Recall that you only need to install a package once (so you should not include the install.packages()
code in your code chunk – you can just run it directly in the console), but you need to load it every time you start a new R session.
If, for example, we wanted to move forward with only a few of the variables in our data frame we could use the select()
function. The first argument is the data frame and the remaining arguments are the names of the columns that you want to keep.
<- select(gapminder, year, country, gdpPercap)
year_country_gdp head(year_country_gdp)
year country gdpPercap
1 1952 Afghanistan 779.4453
2 1957 Afghanistan 820.8530
3 1962 Afghanistan 853.1007
4 1967 Afghanistan 836.1971
5 1972 Afghanistan 739.9811
6 1977 Afghanistan 786.1134
::: {.callout-tip} ## Quotes vs no quotes
Unlike the base R code, df["var"]
, you usually do not need to enclose the variable names in quotes in most tidyverse functions: you can write select(df, var)
rather than select(df, "var")
. :::
You can also use the select()
function to rename columns, for example, the following code will rename the gdpPercap
column to gdp_per_cap
while selecting only the year
, country
, and gdp_per_cap
columns:
<- select(gapminder, year, country, gdp_per_cap = gdpPercap)
gapminder_selected head(gapminder_selected)
year country gdp_per_cap
1 1952 Afghanistan 779.4453
2 1957 Afghanistan 820.8530
3 1962 Afghanistan 853.1007
4 1967 Afghanistan 836.1971
5 1972 Afghanistan 739.9811
6 1977 Afghanistan 786.1134
If we want to remove one column only from the gapminder
data, we can use the -var
syntax. For example, the code below removes the continent
column:
<- select(gapminder, -continent) smaller_gapminder_data
Above we used ‘normal’ grammar, which places the data frame in the first argument of the dplyr function select()
, but the real strength of dplyr
lies in combining several functions using pipes.
|>
Since the pipe’s grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.
<- gapminder |> select(year, country, gdpPercap) year_country_gdp
To help you understand why we wrote that in that way, let’s walk through it step by step.
First, we summon the gapminder data frame and pass it on, using the pipe symbol |>
, to the first argument in the function on the right-hand side of the pipe, which is the select()
function.
So the two pieces of code are equivalent.
select(gapminder, year)
|> select(year) gapminder
The pipe doesn’t only work with select()
. The pipe can be used for every R function. For example, the following are equivalent:
head(gapminder)
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
|> head() gapminder
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
So are these two:
mean(c(1, 5, 3))
[1] 3
c(1, 5, 3) |> mean()
[1] 3
If we only want to look at the European countries in gapminder, we can use the filter()
function:
|> filter(continent == "Europe") |> head() gapminder
country year pop continent lifeExp gdpPercap
1 Albania 1952 1282697 Europe 55.23 1601.056
2 Albania 1957 1476505 Europe 59.28 1942.284
3 Albania 1962 1728137 Europe 64.82 2312.889
4 Albania 1967 1984060 Europe 66.22 2760.197
5 Albania 1972 2263554 Europe 67.69 3313.422
6 Albania 1977 2509048 Europe 68.93 3533.004
We can even chain together select()
and filter()
:
<- gapminder |>
year_country_gdp_euro filter(continent == "Europe") |>
select(year, country, gdpPercap)
head(year_country_gdp_euro)
year country gdpPercap
1 1952 Albania 1601.056
2 1957 Albania 1942.284
3 1962 Albania 2312.889
4 1967 Albania 2760.197
5 1972 Albania 3313.422
6 1977 Albania 3533.004
You can supply multiple conditions to filter()
using a comma. This is equivalent to “and” &
, as only rows where both conditions are satisfied will be returned.
For example, if we now want to show the life expectancy of European countries but only for a specific year (e.g., 2007), we can do as below.
<- gapminder |>
europe_lifeExp_2007 filter(continent == "Europe", year == 2007) |>
select(country, lifeExp)
In the challenge above, first, we pass the gapminder data frame to the filter()
function, then we pass the filtered version of the gapminder data frame to the select()
function. Note: The order of operations is very important in this case. If we used select()
first, filter()
would not be able to find the variable continent
since we would have removed it in the previous step.
You can add new columns as well as modify existing columns in a data frame using mutate()
.
For example, the following code will create a new column called gdp_billion
which contains the product of gdpPercap
and pop
columns and divides by 1 billion.
|>
gapminder mutate(gdp_billion = gdpPercap * pop / 10^9) |>
head()
country year pop continent lifeExp gdpPercap gdp_billion
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6.567086
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7.585449
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8.758856
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9.648014
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9.678553
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11.697659
You don’t have to always create a new variable with mutate()
– you can instead modify an existing variable. For example, the code below replaces the lifeExp
column with the rounded version of lifeExp
:
|>
gapminder mutate(lifeExp = round(lifeExp)) |>
head()
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 29 779.4453
2 Afghanistan 1957 9240934 Asia 30 820.8530
3 Afghanistan 1962 10267083 Asia 32 853.1007
4 Afghanistan 1967 11537966 Asia 34 836.1971
5 Afghanistan 1972 13079460 Asia 36 739.9811
6 Afghanistan 1977 14880372 Asia 38 786.1134
You can easily compute a “summary” of a column using summarize()
. For example, the following code computes the max life expectancy in the data:
|>
gapminder summarize(max_life_exp = max(lifeExp))
max_life_exp
1 82.603
You can even compute multiple summaries at once, e.g., computing both the max and minimum of the lifeExp variable:
|>
gapminder summarize(max_life_exp = max(lifeExp),
min_life_exp = min(lifeExp))
max_life_exp min_life_exp
1 82.603 23.599
Now, we were supposed to be reducing the error-prone repetitiveness of what can be done with base R, but up to now, we haven’t done that since we would have to repeat the above for each continent.
One way to compute the mean lifeExp for each continent using dplyr functions would be:
|> filter(continent == "Europe") |> summarise(mean(lifeExp)) gapminder
mean(lifeExp)
1 71.90369
|> filter(continent == "Africa") |> summarise(mean(lifeExp)) gapminder
mean(lifeExp)
1 48.86533
|> filter(continent == "Asia") |> summarise(mean(lifeExp)) gapminder
mean(lifeExp)
1 60.0649
|> filter(continent == "Americas") |> summarise(mean(lifeExp)) gapminder
mean(lifeExp)
1 64.65874
|> filter(continent == "Oceania") |> summarise(mean(lifeExp)) gapminder
mean(lifeExp)
1 74.32621
But this is very repetitive and is not in line with the tidyverse way of doing things!
The group_by()
function offers a nice alternative approach for grouped operations on a data frame:
|>
gapminder group_by(continent) |>
summarize(mean(lifeExp))
# A tibble: 5 × 2
continent `mean(lifeExp)`
<chr> <dbl>
1 Africa 48.9
2 Americas 64.7
3 Asia 60.1
4 Europe 71.9
5 Oceania 74.3
The group_by()
function essentially splits the data frame into separate data frames to perform the computation, and then joins the result back into a single data frame.
Moreover, group_by()
doesn’t work just with summarize. Can you figure out what the following code is doing?
|>
gapminder group_by(continent) |>
mutate(mean_lifeExp = mean(lifeExp)) |>
ungroup()
# A tibble: 1,704 × 7
country year pop continent lifeExp gdpPercap mean_lifeExp
<chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779. 60.1
2 Afghanistan 1957 9240934 Asia 30.3 821. 60.1
3 Afghanistan 1962 10267083 Asia 32.0 853. 60.1
4 Afghanistan 1967 11537966 Asia 34.0 836. 60.1
5 Afghanistan 1972 13079460 Asia 36.1 740. 60.1
6 Afghanistan 1977 14880372 Asia 38.4 786. 60.1
7 Afghanistan 1982 12881816 Asia 39.9 978. 60.1
8 Afghanistan 1987 13867957 Asia 40.8 852. 60.1
9 Afghanistan 1992 16317921 Asia 41.7 649. 60.1
10 Afghanistan 1997 22227415 Asia 41.8 635. 60.1
# … with 1,694 more rows
The function group_by()
allows us to group by multiple variables. For example, let’s group by year
and continent
and compute the mean gdpPercap for each year-continent combination:
|>
gapminder group_by(continent, year) |>
summarize(mean_gdpPercap = mean(gdpPercap)) |>
ungroup() |>
head()
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
continent year mean_gdpPercap
<chr> <int> <dbl>
1 Africa 1952 1253.
2 Africa 1957 1385.
3 Africa 1962 1598.
4 Africa 1967 2050.
5 Africa 1972 2340.
6 Africa 1977 2586.
Another way to complete the previous challenge is to use the dplyr function arrange()
, which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has a similar syntax to other functions from the dplyr
package. You can use desc()
inside arrange()
to sort in descending order.
First, we can compute the average life expectancy for each country
<- gapminder |>
lifeExp_bycountry group_by(country) |>
summarise(mean_lifeExp = mean(lifeExp)) |>
ungroup()
Then we can arrange the data frame in order of the average life expectancy for each country to view the countries with the lowest average life expectancy
|>
lifeExp_bycountry arrange(mean_lifeExp) |>
head()
# A tibble: 6 × 2
country mean_lifeExp
<chr> <dbl>
1 Sierra Leone 36.8
2 Afghanistan 37.5
3 Angola 37.9
4 Guinea-Bissau 39.2
5 Mozambique 40.4
6 Somalia 41.0
To view the countries with the highest life expectancy, we can use the desc()
function inside arrange()
(or we can view the tail()
of the previous data frame):
|>
lifeExp_bycountry arrange(desc(mean_lifeExp)) |>
head()
# A tibble: 6 × 2
country mean_lifeExp
<chr> <dbl>
1 Iceland 76.5
2 Sweden 76.2
3 Norway 75.8
4 Netherlands 75.6
5 Switzerland 75.6
6 Canada 74.9
A very common operation is to count the number of observations for each group. The dplyr
package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count()
function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort=TRUE
:
|>
gapminder filter(year == 2002) |>
count(continent, sort = TRUE)
continent n
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
If we need to use the number of observations/rows in calculations, the n()
function is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column.
For instance, if we wanted to get the standard error of the life expectancy per continent:
|>
gapminder group_by(continent) |>
summarize(se_le = sd(lifeExp) / sqrt(n()))
# A tibble: 5 × 2
continent se_le
<chr> <dbl>
1 Africa 0.366
2 Americas 0.540
3 Asia 0.596
4 Europe 0.286
5 Oceania 0.775