Introduction to R

Purpose of this workshop

• Learn about why we use R
• Be able to start an R project
• Learn about data structures (especially .csv files)
• Be able to share data
• Be able to read, run, and share R code

Why R?

R is one of the most widely used platforms in research. R is the name of the programming language, and the software it runs in. There are many reasons to learn and use R.

• The software is free
• The algorithms used in base R are well documented and trusted
• Any additional packages published on CRAN are also well documented, so that algorithms can be verified
• The process of working in R relies on writing code. Using code rather than dropdown menus means that you have a record of the work that you’ve done (to re-visit later for edits or to remind yourself of what you did). You can also include comments in your code to remind yourself or explain to others what you have done. Code can be easily emailed or shared among collaborators.
• Extensive help is available for R. Help is included in base R and in packages. Numerous tutorials are available online. Several user groups exist online that allow people to ask and answer questions. Many solutions to problems can be gleaned from the pages documenting these user group questions.
• R is highly extensible through packages created by R users and published online. These packages are usually highly specific, providing tools fit for purpose. Most packages are well documented, including their methods and algorithms, and people can thus verify their accuracy and place their trust in those packages. However, this is not a given.
• R is excellent for producing highly customiseable figures that can be saved in a variety of formats.
• One can conduct almost all of one’s project workflow entirely in R, from data entry and management, to analysis, and even to producing presentations, posters, manuscripts, and books.
• However … as with any technological tool the usual principle of GIGO (Garbage In, Garbage Out) applies. Thus, the onus is on you to ensure that your questions are well constructed, your hypotheses well formulated, your study planned, your data collection faithfully conducted, and your analysis and interpretation rigorously applied.

General notes

There are many ways to skin a cat, and this is especially true for conducting analyses in R. Writing code is part art and part science and, as a result, there are often many different ways to approach the same question and arrive at the same outcome. You will no doubt find better, simpler, and more efficient ways to perform the tasks presented here.

Due to the open-source and collaborative nature of working in R, any given project usually draws on the prior work of a multitude of people who go un-acknowledged. We learn R by finding solutions here and there on user forums, or on blogs, or through tutorials, and those un-credited lines of code become part of our R cannon. Where we can, we should cite the software and the packages in our work, and acknowledge significant contributions from specific people or webpages. Almost all of what you see below has at some point been the brainchild of someone else to be co-opted and adapted by myself. I hope you will do the same and co-opt and adapt what you find useful here.

The rest of this workshop relies on you having already installed R and RStudio and it relies on you having an active internet connection.

Starting an R project

Note that this section applies mainly to Windows users. A good way to start an R project is to create a new folder for the project on your PC. Right-click in the folder in Windows Explorer to get the function menu where you’ll create a new text file. Give the text file an appropriate name and change the file extension from “.txt” to “.R”. You may have to unhide the file extension by going to ‘view’ and then finding the option to unhide the extension. This file is now your R script in which you’ll write your code. This R script can be emailed, uploaded, shared etc. as needed.

I usually make sure that I associate R file types with RStudio in Windows (i.e. files with extensions of “.R” and “.Rmd”). If this has already been done, then you can double click on the script file you just created, and it should open in RStudio. If you have not associated the .R file type with RStudio, then you need to right click on the file and ‘Open with…’

Before we continue in RStudio, we’ll go back to Windows Explorer to make some points about data.

Handling data

Many folks new to R may be used to handling data in a spreadsheet application like MS Excel. There, data are typically mixed with comments, notes, figures, output, functions, colours, shading, italics, bold, and any number of additional means of ‘marking up’ the data. However, it is diffult to use these spreadsheets directly in R or other quantitative applications.

A long-standing method of dealing with data is use relational databases (https://en.wikipedia.org/wiki/Relational_database) with normalised data (https://en.wikipedia.org/wiki/Database_normalization). I won’t go into either of these in detail here (they both require workshops of their own), but it could help you to become familiar with both. Briefly, relational databases consist of sets of data that link together using unique identifying codes. If those sets of data are reduced to their normal forms (first, second, and third), then we end up with highly efficient databases were information is not repeated unnecessarily.

Each of these sets of data or forms should be stored in a flat file that contains no extraneous markup. The standard approach is to use a text file, which can easily be shared, can be read by most software applications, and is generally lightweight (minimal size for the amount of data it contains). One type of text file, is the delimited file such as a ‘comma separated values’ file, or ‘.csv’ file, which, as the name suggests, uses the comma to delimit or separate fields in the file. This is what I recommend since it is easy to create, easily shared, and easily read by others.

A few words of caution though …

1. If your data set contains fields with text (such as a comments or notes field) then you need to make sure that there are no inherent commas in those notes that could create problems wwhen delimiting the fields.
2. Some regional settings in Windows can assign a comma as the decimal separator in numbers, which could also hamper the field separation. If this is the case, it seems as though the semi-colon is the default separator or delimiter. These regional settings can be changed through the Windows control panel, but we also have some work-around solutions in R (see below).

Once you have the data in the correct format, you can save a single worksheet as a .csv file. Try to keep the file names simple too (i.e. avoid spaces in the name – use an underscore if a space is needed).

Some important things to note:

• Make the field headings simple, without spaces, without uppercase lettering, or any non-alpha-numeric characters.
• The field headings should be short, but descriptive enough that someone else can interpret what is contained in your fields.
• The data fields should not mix data types, where possible … i.e. some cells with numbers and some with letters or characters.
• In general, R is rather naive at interpreting data types contained in data sets, and you always need to look at the structure of the dataset that R has interpreted when you import your .csv file (more on that, later).

Preliminaries in RStudio

RStudio layout

When you open RStudio directly, you’ll probably see three ‘windows’ by default. If you double-clicked on the .R textfile you created, you’ll see four ‘windows’. The placement of the windows can be changed in the RStudio settings, but in general, you’ll have your scripts in the upper-left quadrant, your console in the lower-left quadrant, your environment (where objects are stored) in the upper-right quadrant, and then several stacked windows for plots, packages, help, and a few other things in the lower-right quadrant.

You will do most of your work in the script. You’ll see output from the code you’ve run in the console and in the plot window. You can find help in the help window, and you can install or manage packages in the packages window.

In the script, it is helpful to start out with some metadata or comments about your current project. To create comments in your script, you need to precede text with a hash-tag. Anything after the hashtag on the same line will not be run as code and will act as a comment to remind you or your collaborators what you are doing.

Running code

Now that you are ready to work in your script, you will need to run lines or sections of code. You can do this by placing your cursor on a line of code, or highlighting/selecting a line of code, or highlighting a section of code, and then clicking on the ‘Run’ button at the top-right of your script window. However, I find it easier to hit CTRL-ENTER or CTRL-r once I’ve made the cursor placement or selection, rather than pointing and clicking.

Once you run a line of code, you’ll see a response in the console window. The code that has been run will appear in blue. Messages that might require your attention (for example, error messages) will appear in red, and messages that purely provide information will appear in black.

Workspace and working directory

The first thing to do is the clear the workspace or environment of any objects that might remain from a previous analysis

Then you need to tell R where the working directory is. This is where your data files (.csv) should be stored and where R will save output (i.e. if you save plots or figures or if you save results or new data as .csv files).

Note that you need to use double backslashes or single forward slashes in this line of code.

Packages

Base R is what you have when you install R and follow all the default installation settings. Packages are created by R users and are published on CRAN (the Comprehensive R Archive Network, found at https://cran.r-project.org/) for others to download and use. These packages extend the functionality of base R and are often required to do non-standard analysis.

If you want to use a package, you need to have it installed on your system. You only need to do this once, unless you update R, in which case you’ll need to re-install the packages you need. You should probably update R at least once a year.

You can download the packages in a format called ‘binaries’ from CRAN (they get saved as zip files), and then install them at a later time, even if you are offline, but the easiest method is to do this in real time while connected to the internet. You have two methods in RStudio to do this.

The first method is the more general method, and this works in R on its own – i.e. even if you are not working in RStudio. It is code based. The only potential problem with this method is that you need to be certain of the package name (including spelling and use of uppercase). As an example, we’ll use a package called ‘foreign’, which is a package that allows you to import non-standard data files (including files generated from other statistics software).

The second method is RStudio specific. You click on the Packages window in the lower-right quadrant. This shows you which packages are already installed. If the one you want is not there, then click on “Install”. Now you can install from a previously downloaded package zip file, or from the internet via CRAN. In the input box, start typing the name of the package you want, and after a few seconds RStudio will give you auto-complete options. Click on the correct package and then “Install”. The auto-complete function means that you don’t have to worry about uppercase or spelling too much (apart from the first few characters of the package name).

After installing a package, you still need to attach it for your current R session.

Remember – you need to install a package once – you don’t want to keep downloading it for every new R session. And, you need to attach a package for every session you need that package.

R Syntax

We won’t go into detail on the R syntax today – this is something you can look at in online tutorials or workbooks in more detail. There are also several handy cheatsheets available, both for base R and for packages that allow you to extend R. You can find such cheatsheets with a quick web search or go here https://www.rstudio.com/resources/cheatsheets/ or here https://cran.r-project.org/doc/contrib/Short-refcard.pdf or here https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf.

We’ll just cover a few basics here.

Firstly, you need to know about the assignment operator “<-” which assigns whatever occurs on the right of the operator to whatever object you have on the left. Thus, if you want to make a variable called “x” and assign it a value of 10, you can do so with the assignment operator.

By calling the object you’ve created, you can see what it is.

It is also helpful to see what data type you are dealing with. You can look at the class of the data type.

And look at the structure of the object, which is more helpful when you have larger and more complex objects.

Import data

As discussed above, you’ll usually have your data in .csv files, which you can import into R and then work with.

If the .csv file you are using was created with different regional settings (see above), then instead of having decimal points separated by ‘.’ and the fields separated by ‘,’ you may have decimals separated by ‘,’ and fields separated by ‘;’

To check what is going on in the .csv, you can go to your open folder in Windows Explorer and right-click on the file in question, and then ‘open with …’ and select notepad. Now you should be able to see what separator is used. If the separator is ‘;’ and decimals are denoted by ‘,’ then adjust the code above to this:

Here is a comparison of what the output will look like when you get the separator incorrect and then correct:

Now you want to look at the object:

First look at the class

Then the structure. Here, note the data types in each field.

Then look at the first 5 rows of the object

And finally the last 5 rows.

Extract parts of the object

For different object or data classes, you will use different operators. The most common, when you are working with data frame objects is to use the dollar sign to select a particular field. Here we’ll look at “y” in the “test” object.

We call also select by row

Or by column

Or by row and column

Subset the data

It is often useful to subset your data. Here we’ll pick out just the parts of ‘test’ with group equal to ‘a’.

Exploratory data analysis

Before running any analysis, you should look at the shape of the data to determine whether it follows a common probability distribution. This will often determine what sort of tests or analyses you can run or how you should adapt the test or transform your data.

The easiest way is to plot the histogram of a variable. Here we’ll look at x and y.

Data distributions

Just to remind you of some basics about probability distributions, we are going to simulate data by randomly drawing from a distribution and then look at the histogram for the data. Note that in the examples below, we keep creating the object y – thus we are writing over the data for that object. If you wanted to keep each dataset for use later, you would need to use a unique name for each object that you create.

Normal distribution

Generate 100 data points from a normal distribution with a mean of 10 and standard deviation of 1.

Plot the histogram:

Lognormal distribution

Generate 100 data points from a lognormal distribution with a mean of 2 and standard deviation of 0.5.

Plot the histogram:

Poisson distribution

Generate 100 data points from a poisson distribution with a lambda of 2.

Plot the histogram:

Note how in the previous two data sets, we had a continuous variable, and now we have a discrete variable. The poisson distribution is often used to model count data.

Tufte plots

Edward Tufte is a proponent of simplifying and clarifying scientific communication. He has three excellent books on the subject: The Visual Display of Quantitative Information; Envisioning Information; Visual Explanations: Images and Quantities, Evidence and Narrative.

One of the themes that emerges from his work is the need to reduce unnecessary ink in figures. There is no package for R (that I know of) that does this automatically, but there are some simple methods that one can follow to get the same effect.

Simulate data

Generate data from a simple model with one linear predictor, $$\beta = 4$$, and error that follows a normal distribution with a mean of 5 and standard deviation of 8.

Look at the x and y values

Plots

Now we could simply plot the data using the default R plotting function.

Aspects of this type of figure that are possibly unnecessary include the border around the plotting area. The default is also to plot the axis labels parallel to the axes, which in the case of the y-axis might be suboptimal. You may also notice that the axes cover the range of the data in x and y, but do not necessarily cover the range we would want: in this case, it might be helpful to include the origin (0,0) in the plot – although this is not always the case.

Another popular option is to use a graphics package.

In this package, the plot border is removed, but a grey background with white grid has been added. The axes are still generated automatically and axis tick marks might not be chosen optimally. Axis labels tend to be displayed in a small point size. The data have been plotted with clear filled circles – but this could be problematic if points are over-plotted or overlapping.

The defaults of both the standard R plotting function and this package can be changed – and often should be to improve clarity and reduce clutter.

Suggestions moving forward

My suggestion would be to look at the data first and decide on sensible limits to the length of the x and y axes.

In this case, it might be sensible to limit the x-axis from 0 to 11, and the y-axis from 0 to 60. An empty plot can be generated, upon which one can then add elements. You’ll notice in the code that I put each element on a new line, preceded by a comma. If you want to comment out a specific element, just add a # sign at the beginning of the line that is unnecessary.

If one really wanted to emulate the grey background and white grid lines from the package, this can be done fairly easily. In such a case, it may be helpful to increase the contrast with the data by converting the points to filled circles.

Conclusion

With just a few tweaks, the standard plotting function can be used to create plots that reduce the amount of “chart-junk”. In my opinion, the grey background with the white grid is unnecessary, but it has become very popular as an aesthetic convention, and in digital media (such as web or projected presentations) there may be a place for this. In general, I would suggest the minimalist approach for most journals. In a case such as this, where there are no overlapping points, I recommend the code and output below.

Applications installed in Windows

It is sometimes helpful to print a list of applications installed in Windows. You might need this if you are going to perform a re-install of Windows and want to add the applications you used to have. If you want to mirror your system on two different workstations, you can also use this method to keep track of applications.

Windows 10 start menu problems

Periodically, the Windows 10 start button stops working, rendering your operating system rather difficult to use. It can take a long time to solve the problem, but you may need to access your software applications in the short-term to do work (something you may struggle to do if your start menu doesn’t work).

Fingerprint scanners with Windows 10

A few thoughts about Windows 10 and biometric logins …

Problem:

Fingerprint scanning is a quick and easy option in Windows 10, but one can quickly default to that option and eventually (through lack of use) forget the user password (which is always the most important login option). This can lead to a situation with a local user account with administrative privileges (rather than a Microsoft Account login) where the user can only log into Windows with a fingerprint scan, but cannot change the password or create a PIN. This can be a precarious situation for several reasons:

Upcoming conference presentation

During the middle of October, I will be in Berlin for the 5th Annual Conference of the International Society for Wildlife Endocrinology. I will be presenting a talk on our work using the Nile crocodile as a biological indicator species, and a poster on stress and the landscape of fear in banded mongooses.