R is a statistical programming language widely used in the data science field.
R provides a record of your analysis
R uses scripts to record which steps you take during your analysis. Although this might initially take longer to learn than pointing and clicking your way through some other software, it makes it much easier to go back to your analysis in the future and re-do or adjust it, if needed. An added benefit is that it enables you to make comments as to why you’ve done certain steps in your analysis.
Reproducibility
It’s the latest buzz-word in science and for good reason! Using R to record every step of your analysis and linking this to the original raw data that you used to do the analysis allows people to reproduce your work. This may very well be your future self!
This kind of record also enables other people to decide for themselves if the choices you’ve made in the analysis are reasonable and allows them (and yourself) to spot any mistakes.
For those of you who work in academia there is an added advantage. An increasing number of journals and funding bodies now require you to deposit your raw data and the analysis that led to your conclusions. Knowing how to use R can really be a benefit here.
Even news agencies, such as the BBC Shared Data Unit now make the underlying analysis of many of their news items available, many of which are performed in R. With the onset of “fake news”, more and more people are expecting transparency!
When opening RStudio you’re greeted with the following screen (note that this may look different on your computer, depending on your operating system and colour scheme):
There are three panels visible at this stage, clock-wise:
A bit more about the console. Although it can be useful to execute code directly (in the case of one-off actions, such as the installation of packages) you will want to save your code most of the time so you can share it or refer back to it later. In that case it’s good to start a script to write the code in.
To start a new script go to File -> New File -> R Script
. Your new screen will look something like this, with a new panel appearing in the top right corner:
You might have noticed that the bottom right panel contains five different tabs, one of which is called Files
. In our example it shows the current working directory. The concept of a working directory is very important one. When we’re working on a project, it is good practice to keep all the data, scripts, images and any other files contained in a single folder. This single folder we call the working directory and it allows us to use relative paths in scripts as opposed to absolute paths.
Relative path: data/raw/gapminder.csv
Absolute path: C:/Users/Admin/Documents/R/Projects/2020_example/data/raw/gapminder.csv
The beauty of using relative paths is that you can move the working directory folder to any location you like and all the references and links in the scripts will continue to work. Hard coding absolute paths is always a bad idea because it makes it much harder to share your code because it is all linked to a specific directory on your computer.
Apart from using a working directory it also pays off to think about how to organise your files within it. In this example the working directory is called 2020_example
and it contains several other directories to organise our files:
data
folder contains a raw
and processed
sub-folder. This is often a good idea, because you want to make sure you never overwrite your original, raw data. Any analyses can always be repeated (save your script!) and the output of them can go into the processed
data folder.documents
folder contains documents that are relevant to the project.figures
folder can contain sub-folders to separate quick exploratory plots and publication-quaility figures.notebooks
folder can contain your R Markdown files in which you explore your data in detail.scripts
folder contains all the scripts used in the final analyses. It often helps to order them and name them consistently, such as 01-read_data.R
, 02-clean_data.R
, 03-exploratory_analysis.R
etc. It’ll save you a lot of headache when trying to find things back in the future!The working directory also contains a file named 2020_example.Rproj
. This is a file that is created by RStudio and when you open that it automatically opens RStudio in a new session and sets the working directory to where this Rproj
file is located. Useful! To create your own R project, go to File -> New Project
.
Note: You can create new directories by clicking on the
New Folder
button or by creating them directly in your file browser.It is good practice to avoid any spaces in your directory names because these names can create issues when used in code. Instead, you can use the underscore.
R is also case-sensitive, so be consistent in your use of uppercase and lowercase. I personally use lowercase exclusively because it’s one less thing to think about!
Making your data reproducible also involves thinking about where to store your data. We will look more closely into GitHub later, but for now it’s suffice to say that you can use GitHub to easily share your analyses, which can also include the data itself.
Below is an example of a data set that is available on the GitHub repository for this course.
This works well if your data sets are not too large, but depending on your research you might soon find it becomes limiting. There are online repositories for all kinds of data, such as GenBank for genetic sequences, ENA for various sequencing information etc.
These repositories keep your data available and provide a fixed, unique URL to your data that you can use within your scripts.
When publishing your research journals will usually require a link to all the data. If you’re a researcher at Cambridge University you can use Apollo. This is the University’s online repository and takes all kind of types of data, enabling you to keep all your data in one place.
There are plenty more repositories to choose from and the main things to keep in mind are accessibility and stability. Can people access your data freely and are the data likely be in the same place in 10 years time?
Some might have completed the Introduction to R for Biologists course at the University. If so, you will have come across the concept of data exploration workflow.
When you are working on a project that requires data analysis, you will normally need to perform the following steps:
More information on this workflow can be found in the R for Data Science book. Let’s briefly go over each step:
gapminder
data set the country
column specifies the country, using a capital letter at the start. A common problem in data sets is inconsistent capitalisation. For example, here Austria would be different to austria, since R is case-sensitive. During the tidying stage it is important to make the dataset consistent and much as possible so that you can focus on the questions you are trying to solve in your analysis.All these stages in the data exploration workflow can be achieved by programming in R. In this course we will focus on the communicate aspect of the workflow. And for that we need to turn to RMarkdown.