4 Vectors and types

One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file.

Let’s start by using R to create a dataset, which we will then save in our data/ directory in a file called feline-data.csv. First, let’s create the dataset in R using the data.frame() function:

cats <- data.frame(coat = c("calico", "black", "tabby"),
                    weight = c(2.1, 5.0, 3.2),
                    likes_string = c(1, 0, 1))

Then we can save cats as a CSV file. It is good practice to call the argument names explicitly so the function knows what default values you are changing. Here we are setting row.names = FALSE. Recall you can use ?write.csv to pull up the help file to check out the argument names and their default values.

write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

You should now see that you have a new file, feline-data.csv, in your data/ folder, whose contents look like this:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

Tip: Editing Text files in R

Alternatively, you can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

We can then load this .csv file into R via the following:

cats <- read.csv(file = "data/feline-data.csv")
cats

    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

Tip: read.table() and delimiters

The read.table function is used for reading tabular data stored in a text file where the columns of data are separated by punctuation characters such as .csv files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

There are lots of things we can do with our cats data object, such as extracting individual columns using the $ operator:

cats$weight

[1] 2.1 5.0 3.2

cats$coat

[1] "calico" "black"  "tabby"

Each column is a vector.

There are lots of operations that we can do on vectors, such as:

## Say we discovered that the scale weighs two Kg light:
cats$weight + 2

[1] 4.1 7.0 5.2

paste("My cat is", cats$coat)

[1] "My cat is calico" "My cat is black"  "My cat is tabby"

But what about:

cats$weight + cats$coat

Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing data in R.

4.1 Data Types

If you guessed that the last command will return an error because 2.1 plus "black" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data types. We can ask what type or “class” or “type” of data something is:

class(cats$weight)

[1] "numeric"

You will typically encounter the following main types: numeric (which encompasses double and integer), logical, and character (and factor, but we won’t encounter these until later). There are others too (such as complex), but you’re unlikely to encounter them in your data analysis journeys.

Let’s identify the class of several values:

class(3.14)

[1] "numeric"

class(TRUE)

[1] "logical"

class("banana")

[1] "character"

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types. This strictness has some really important consequences.

A user has added details of another cat. This information is in the file data/feline-data_v2.csv.

file.show("data/feline-data_v2.csv")

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we find in the weight column:

cats_v2 <- read.csv(file="data/feline-data_v2.csv")
class(cats_v2$weight)

[1] "character"

Oh no, our weights aren’t the numeric class anymore! If we try to do the same math we did on them before, we run into trouble:

cats_v2$weight + 2

Error in cats_v2$weight + 2: non-numeric argument to binary operator

What happened?

The cats data we are working with is something called a data frame. Data frames are one of the most common and versatile types of data structures we will work with in R.

A given column in a data frame can only contain one single data type (but each column can be of a different type).

In this case, R does not read everything in the data frame column weight as numeric (specifically, R reads the entry 2.3 or 2.4 as a character), therefore the entire column data type changes to something that is suitable for everything in the column.

When R reads a csv file, it reads it in as a data frame. Thus, when we loaded the cats csv file, it is stored as a data frame. We can recognize data frames by the first row that is written by the str() function:

str(cats)

'data.frame':   3 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby"
 $ weight      : num  2.1 5 3.2
 $ likes_string: int  1 0 1

Data frames are composed of rows and columns, where each column is a vector of the same length. Different columns in a data frame can be made up of different data types (this is what makes them so versatile), but everything in a given column needs to be the same type (e.g., numeric, character, logical, etc).

Let’s explore more about different data structures and how they behave. For now, let’s go back to working with the original feline-data.csv file while we investigate this behavior further:

feline-data.csv:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

cats <- read.csv(file = "data/feline-data.csv")
cats

    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

4.2 Vectors and Type Coercion

To better understand this behavior, let’s learn more about the vector. A vector in R is essentially an ordered collection of values, with the special condition that everything in the vector must be the same basic data type.

A vector can be created with the c() “combine” function:

c(1, 8, 1.2)

[1] 1.0 8.0 1.2

The columns of a data frame are also vectors:

cats$weight

[1] 2.1 5.0 3.2

The fact that everything in a vector must be the same type is the root of why R forces everything in a column to be the same basic data type.

4.2.1 Coercion by combining vectors

Because all entries in a vector must have the same type, c() will coerce the type of each element to a common type. Given what we’ve learned so far, what do you think the following will produce?

quiz_vector <- c(2, 6, '3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type. Consider:

coercion_vector <- c('a', TRUE)
coercion_vector

[1] "a"    "TRUE"

another_coercion_vector <- c(0, TRUE)
another_coercion_vector

[1] 0 1

4.2.2 The type hierarchy

The coercion rules go: logical -> numeric -> character, where -> can be read as “are transformed into”. For example, combining logical and character transforms the result to character:

c('a', TRUE)

[1] "a"    "TRUE"

Tip

A quick way to recognize character vectors is by the quotes that enclose them when they are printed.

You can try to force coercion against this flow using the as. functions:

character_vector_example <- c('0', '2', '4')
character_vector_example

[1] "0" "2" "4"

character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric

[1] 0 2 4

numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical

[1] FALSE  TRUE  TRUE

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

But coercion can also be very useful! For example, in our cats data likes_string is numeric, but we know that the 1s and 0s actually represent TRUE and FALSE (a common way of representing them). We should use the logical datatype here, which has two states: TRUE or FALSE, which is exactly what our data represents. We can ‘coerce’ this column to be logical by using the as.logical function:

cats$likes_string

[1] 1 0 1

cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string

[1]  TRUE FALSE  TRUE

Challenge 1

An important part of every data analysis is cleaning the input data. If you know that the input data is all of the same format, (e.g. numbers), your analysis is much easier! In this exercise, you will clean the cat data set from the chapter about type coercion.

4.2.3 Copy the code template

In your quarto file in RStudio, start a new code chunk and copy and paste the following code. Then move on to the tasks below, which will help you to fill in the gaps (______).

# Read data
cats <- read.csv("data/feline-data_v2.csv")
# 1. Print the data
_____

# 2. Show an overview of the table that prints out the type of each column
_____(cats)

# 3. The "weight" column has the incorrect data type __________.
#    The correct data type is: ____________.

# 4. Correct the 4th weight data point with the mean of the two given values
cats$weight[4] <- 2.35
#    print the data again to see the effect
cats

# 5. Convert the weight to the right data type
cats$weight <- ______________(cats$weight)

#    Calculate the mean to test yourself
mean(cats$weight)

# If you see the correct mean value (and not NA), you did the exercise
# correctly!

4.3 Instructions for the tasks

4.3.1 1. Print the data

Execute the first statement (read.csv(...)). Then print the data to the console

Tip 1.1

Print the contents of any variable by typing its name.

Solution to Challenge 1.1

Two correct solutions:

cats

    coat weight likes_string
1 calico    2.1         TRUE
2  black    5.0        FALSE
3  tabby    3.2         TRUE

print(cats)

    coat weight likes_string
1 calico    2.1         TRUE
2  black    5.0        FALSE
3  tabby    3.2         TRUE

4.3.2 2. Overview of the data types

Use a function we saw earlier to print out the “type” of all columns of the cats table.

Tip 1.2

In the chapter “Data types” we saw two functions that can show data types. One printed just a single word, the data type name. The other printed a short form of the data type and the first few values. We recommend the second here.

Solution to Challenge 1.2

str(cats)

'data.frame':   3 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby"
 $ weight      : num  2.1 5 3.2
 $ likes_string: logi  TRUE FALSE TRUE

4.3.3 3. Which data type do we need?

The shown data type is not the right one for this data (weight of a cat). Which data type do we need?

Why did the read.csv() function not choose the correct data type?
Fill in the gap in the comment with the correct data type for cat weight!

Tip 1.3

Scroll up to the section about the type hierarchy to review the available data types

Solution to Challenge 1.3

Weight is expressed on a continuous scale (real numbers). The R data type for this is “numeric”.
The fourth row has the value “2.3 or 2.4”. That is not a number but two, and an English word. Therefore, the “character” data type is chosen. The whole column is now text because all values in the same columns have to be the same data type.

4.3.4 4. Correct the problematic value

The code to assign a new weight value to the problematic fourth row is given. Think first and then execute it: What will be the data type after assigning a number like in this example? You can check the data type after executing to see if you were right.

Tip 1.4

Revisit the hierarchy of data types when two different data types are combined.

Solution to challenge 1.4

The data type of the column “weight” is “character”. The assigned data type is “numeric”. Combining two data types yields the data type that is higher in the following hierarchy:

logical < numeric < character

Therefore, the column is still of type character! We need to manually convert it to “numeric”.

4.3.5 5. Convert the column “weight” to the correct data type

Cat weights are numbers. But the column does not have this data type yet. Coerce the column to floating point numbers.

Tip 1.5

The functions to convert data types start with as.. You can look for the function further up in the manuscript or use the RStudio auto-complete function: Type “as.” and then press the TAB key.

Solution to Challenge 1.5

cats$weight <- as.numeric(cats$weight)

4.4 Some basic functions for creating vectors

The combine function, c(), can also be used both to create a new vector as well as to append things to an existing vector:

ab_vector <- c('a', 'b')
ab_vector

[1] "a" "b"

combine_example <- c(ab_vector, 'z')
combine_example

[1] "a" "b" "z"

You can also make a series of numbers using the : syntax as well as the seq() function:

mySeries <- 1:10
mySeries

 [1]  1  2  3  4  5  6  7  8  9 10

seq(10)

 [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10, by = 0.1)

 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

The head() and tail() functions show the first and last few entries of a vector, respectively.

sequence_example <- 20:25
head(sequence_example, n = 2)

[1] 20 21

tail(sequence_example, n = 4)

[1] 22 23 24 25

The length() function computes the number of entries in the vector:

length(sequence_example)

[1] 6

And the class() function reports the class/type of the values in the vector:

class(sequence_example)

[1] "integer"

4.5 Factors

Let’s consider a new data type: the factor.

For an object containing the data type factor, each different value represents what is called a level, and is often how categorical variables/columns whose values can have a finite set of options will be formatted.

Can you identify any categorical variables in the cats data frame? What about the coat variables?

cats$coat

[1] "calico" "black"  "tabby"

The coat variable is currently formatted as a character variable

class(cats$coat)

[1] "character"

but we can convert it to a factor using the as.factor() function:

cats$coat <- as.factor(cats$coat)
class(cats$coat)

[1] "factor"

Let’s take a look at the factor-formatted coat column:

cats$coat

[1] calico black  tabby 
Levels: black calico tabby

It looks very similar to the character format, but now our output tells us that there are the following “levels”: “black”, “calico”, “tabby”

One common pitfall occurs when converting numerically coded factors to a numeric type.

If we convert a coat to a numeric type, it replaces each level with a number in the order that the levels are defined (the default is alphabetical order):

as.numeric(cats$coat)

[1] 2 1 3

But if the factor levels are themselves numbers,

factor_weight <- as.factor(cats$weight)
factor_weight

[1] 2.1 5   3.2
Levels: 2.1 3.2 5

and we convert this numeric factor to a numeric type, the numeric information will be lost:

as.numeric(factor_weight)

[1] 1 3 2

Fortunately, factor and character types behave fairly similarly across most applications, so it usually won’t matter which format your categorical variables are encoded, but it is important to be aware of factors as you will undoubtedly encounter them in your R journey.