2 Project Management With RStudio
2.1 Introduction
The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually, everything is a bit mixed together.
Managing your projects in a reproducible fashion doesn’t just make your science reproducible, it makes your life easier.
— Vince Buffalo (@vsbuffalo) April 15, 2013
Most people tend to organize their projects like this:
There are many reasons why we should avoid this:
It is really hard to tell which version of your data is the original and which is the modified;
It gets messy because it mixes files with various extensions together;
It probably takes you a lot of time to find things, and relate the correct figures to the exact code that has been used to generate them;
A good project layout will ultimately make your life easier:
It will help ensure the integrity of your data;
It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
It allows you to easily upload your code with your manuscript submission;
It makes it easier to pick the project back up after a break.
2.2 A possible solution
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.
The simplest way to open an RStudio project once it has been created is to click through your file system to get to the directory where it was saved and double-click on the .Rproj
file. This will open RStudio and start your R session in the same directory as the .Rproj
file. All your data, plots, and scripts will now be relative to the project directory. RStudio projects have the added benefit of allowing you to open multiple projects at the same time each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.
2.3 Best practices for project organization
One of the more effective ways to work with R is to start by writing the code you want to run directly in the code chunks of a quarto document (or in an .R script), and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the “Run” button) in the interactive R console.
However, it is important to save all of the code that led to your final results, e.g., in quarto documents and R scripts.
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:
2.3.1 Treat data as read-only
This is probably the most important goal of setting up a project. Data is typically time-consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is, therefore, a good idea to treat your data as “read-only”.
You should keep an un-modified (read-only) copy of the raw data in a data/
folder.
2.3.2 Data Cleaning
In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging” or “data cleaning”. Writing a data cleaning function and and saving this function in a standalone R script can help with reproducibility.
2.3.3 Writing functions
However, when you have multiple quarto analysis files, you often want to run the same code (e.g., to load and clean the data) in each quarto file. Rather than repeating your code, it is a good idea to save reusable code as functions in separate R scripts that can be stored in a functions/
folder.
2.3.4 Treat generated output as disposable
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from re-compiling your analysis files.
2.3.5 Working directory
Knowing R’s current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory.
Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing .Rproj
file, it will open that project and set R’s working directory to the folder that the file is in.