3 A recommended analysis workflow

Click here to download a basic RStudio project folder already set up with data and notebook subfolders. Put the folder anywhere on your computer, just not inside another RStudio project folder.

This is intended as a rough outline of the sequence of steps one commonly goes through when working on notebooks:

Before you start on the notebook, ask yourself: am I working in a Project?. If not, fix this!
What is my question? Am I clear about what I want the notebook to do?
Load packages
Load data
Inspect data
Clean up or tidy or manipulate the data in some way - whatever it takes to get it in the form needed for the analysis steps
Summarise the data
Plot the data
Do statistical analysis
Decide what all this has told you and report it in plain English.

Details will differ from notebook to notebook, but this sequence of steps is very common.

3.1 Are you working within your Project?

Before we even think of the notebook, we need to make sure that we are working within our Project. If we are not doing this, bad things will happen. If you are, the Project name will be at the top right of the RStudio window. If you are not, save the notebook you are working on, and go to File/Open Project and open your Project. If you haven’t even got a ‘Project’ or don’t know what that means then just make sure that everything you need for whatever you are working on is in one folder and then turn that folder into a Project. (So a ‘Project’ is just a regular folder that has been given superpowers.) You do that by going to File/New Project/Existing Directory. Then you navigate to your folder and click on Create Project. RStudio will then restart and you will see the name of your newly anointed Project folder at the top-right of the RStudio window. You know that a folder is a ‘Project’ because it will have a .Rproj file inside it.

If all this sounds complicated, don’t worry. It really isn’t. Just get someone to show you how to do it and you will be fine.

Now, to the notebook itself:

3.2 Statement of the question(s) to be investigated

Without thinking this through, you won’t know what your notebook is for…

What is the analysis that will follow for? What question are you trying to answer? What hypotheses are you trying to test?

Suppose we were trying to test the hypothesis that there is no difference between the body masss of the setosa, versicolor and virginica species of penguin. All we have to go on are the body masss of the plants we happened to measure. From these measurements we want to make a statement about these three species in general.

3.3 Open a notebook

In RStudio, go to File/New File/ Quarto Document. Delete everything below the yaml section at the top. This strangely named section is the bit between the two lines with three dashes in. For the most part, we will not need to worry about this section. We just should not delete it entirely. What is useful to do is to amend the title to something sensible, and to add author: "your name" and date: "the date in any old format" lines. I also add a couple of final lines that suppress warnings and messages that might clutter up my printed output, so that your yaml will look something like this:

---
title: "A typical workflow"
author: "Who wrote this?"
date: "Today's date"
output:
  html_document:
    df_print: paged
execute:
  message: false
  warning: false
---

Delete everything beneath this yaml section. The big empty space that then leaves you with is where you write your code. Remember that in quarto documents, the code goes in ‘chunks’ that are started and finished with by lines with three backticks. Any other text goes between the chunks and you can format this text using the simple rule of Markdown, available in the RStudio Help menu at Help/Markdown Quick Reference. Thus your notebook will end up looking something like this:

---
title: "A typical workflow"
author: "Who wrote this?"
date: "Today's date"
output:
  html_document:
    df_print: paged
execute:
  message: false
  warning: false
---

## First header

Any text we want to add. Note that a code chunk starts with delimiters, like this:

```{r}

```


Those are back-ticks, not apostrophes!

```{r}
library(tidyverse) # some actual R code
```

## Second header

Any text we want to add to explain what this next chunk does

```{r}
library(tidyverse) # some actual R code
```

3.4 Load Packages

R is a very powerful statistical programming language, but anyone using it typically enhances it in ways that suit them by adding in extra functions and data sets. This is done by loading so-called packages. There are thousands of these, contributed by a host of subject specialists around the world. It is totally normal to use a few packages when doing an analysis in R.

You will nearly always want the first five packages, and often you will appreciate the sixth, janitor. Others, such as vegan will be useful from time to time, depending on what you are doing. If any of these lines throw an error, it is most likely because of a typo or because you have not yet installed that package. Do so in the console pane (not in this notebook!) using the function install.packages("name of package"). Then run this whole chunk again.

```{r}
library(tidyverse)
library(here)
library(ggfortify)
library(readxl)
library(cowplot)
library(janitor)
library(vegan)
```

3.5 Load data

There are several ways to do this, so details will differ depending on what file type your data is saved in and where it is stored.

Here are some examples. In each case code here presumes that the data is stored in a subfolder called ‘data’ within the Project folder, and we use the function here() from the here package. In my experience this dramatically simplifies the business of finding your data, wherever your notebook is. It makes it easier for you to share your notebook with others and be confident that what worked for you will work for them. It does require that you are working within your project.

3.5.1 If from a csv file

If you have your data in a data subfolder within your project, this chunk will work. Just substitute the name of your data file

```{r}
filepath<-here("data","ozone.csv")
ozone<-read_csv(filepath)
glimpse(ozone)
```

Rows: 20
Columns: 3
$ garden.id       <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9", …
$ garden.location <chr> "West", "West", "West", "West", "West", "West", "West"…
$ ozone           <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8, …

3.5.2 If from an Excel file

You will need to use read_excel() from the readxl package, and you have to specify the name of the worksheet that holds the data you want. You can, if you want, specify the exact range that is occupied by the data. However I suggest you avoid doing this unless it turns out that you need to do so. If your data is a nice, neat, rectangular block of rows and columns, you should find that you don’t need to specify the range.

```{r}
filepath<-here("data","difference_data.xlsx")
penguins <- read_excel(path = filepath, sheet = "penguins") |>
  clean_names()
glimpse(penguins)
```

Rows: 342
Columns: 8
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
$ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
$ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…
$ sex               <chr> "male", "female", "female", "female", "male", "femal…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

3.5.3 If from a URL

You can load data into R directly from a URL if you are given one.

Here, we load data from a file stored in a ‘repo’ on my github account:

```{r}
# penguin<-read_csv("https://raw.githubusercontent.com/mbh038/r4nqy/refs/heads/main/data/penguin.csv")
#   clean_names()
# glimpse(penguin)
```

3.6 Clean / Manipulate the data

Often we need to do some sort of data ‘wrangling’ to get the data into the form we want. For example we may wish to tidy it (this has a particular meaning when applied to data sets), to remove rows with missing values, to filter out rows from sites or time periods that we don’t want to include in our analysis, to create new columns and so on.

For example, let’s create a new data frame for just the adelie penguins:

```{r}
adelie <- penguins |>
  filter(species == "Adelie") # filter picks out rows according to criteria being satisfied in some column
glimpse(adelie)
```

Rows: 151
Columns: 8
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
$ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
$ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…
$ sex               <chr> "male", "female", "female", "female", "male", "femal…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

or maybe we just want the columns that contain numeric data and not the one containing the species identifiers, islands, and sex which are text:

```{r}
penguin_numeric <- penguins |>
  dplyr::select(bill_length_mm:body_mass_g) # select() retains or leaves out particular columns. Here, we leave out the species column.
glimpse(penguin_numeric)
```

Rows: 342
Columns: 4
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
$ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
$ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…

3.7 Summarise the data

How big is the difference between the mean of this group over here and that group over there, and how big is that difference compared to the precision with which we know those means? We nearly always want to do this as a first way to get insight into whether we will or will not reject our hypothesis. For example, let’s find the mean body masss of the three species, the standard errors of those means and save the results to a data frame called petal_summary

```{r}
penguin_summary<-penguins |>
  group_by(species) |>
  summarise(n = n(),
            mean_body_mass = mean(body_mass_g),
            se_body_mass = sd(body_mass_g/sqrt(n())))
penguin_summary
```

# A tibble: 3 × 4
  species       n mean_body_mass se_body_mass
  <chr>     <int>          <dbl>        <dbl>
1 Adelie      151          3701.         37.3
2 Chinstrap    68          3733.         46.6
3 Gentoo      123          5076.         45.5

We can look at this table and already get an idea as to whether the body masses are the same or are different for the three species.

3.8 Plot the data

The next step is usually to plot the data in some way. We would typically use the ggplot2 package from tidyverse to do this.

3.8.1 Bar plot with error bars

We could plot a bar plot with error bars, working from the summary data frame that we created:

```{r}
penguin_summary |>
  ggplot(aes(x = species, y = mean_body_mass)) +
  geom_col(fill="#a6bddb") + # this is the geom that gives us a bar plot when we have already done the calculations
  geom_errorbar(aes(ymin = mean_body_mass - se_body_mass, ymax = mean_body_mass + se_body_mass), width = 0.15) +
  labs( x = "species",
        y = "Mean body mass (g)",
        caption = "Error bars are ± one standard error of the mean") + # important to say what these error bars denote
  theme_cowplot()
```

Note that we have given the bars a fill colour - we got this color from this site due to the cartographer Cynthia Brewer, who is behind the various incarnations of the Brewer package in R, which is great for getting colours that work well. We have used the same colour for each species since the x-axis labels already tell us which bar relates to which species. To use a different colour for each bar would imply there is some extra information encoded by colour. Since there is not, it serves no purpose to have different colours, and potentially confuses the reader. Remember always that a plot is intended to convey a message. Anything that detracts from that message should be avoided, however pretty you think it is.

A couple of points could be made about this type of plot:

First, what about those error bars? Three types of error bar are in common usage and there are arguments in favour and against the use of each of them:

the standard deviation tells us about the spread of values in a sample, and is an estimate of the spread of values in a population;
the standard error of the mean, as used here, is an estimate of the precision with which the sample means estimates the respective population means for each of the species.
the confidence interval, typically a 95% confidence interval, gives us the region within which we are (say) 95% confident that the true species mean body mass might plausibly lie.

Which type of error bar is best to use depends on what story you want to tell. Here, because we are interested in whether there is evidence of a difference in the mean body mass of different species, we have gone for the standard eror of the mean.

Regardless of which error bar you use and why, you should always tell the reader which one you have gone for, as we have in the caption to the figure.

A second point about this bar plot is that it doesn’t tell us very much, and indeed nothing that we didn’t already know. It only conveys the mean and standard error values for each species, which is information we already have, arguably more compactly and in more easily readable form, in the table we created. Further, it potentially obscures information that might come from knowing the distribution of the data.

Here are three other plot types that do show the distribution of body masss for each species and thus add extra information to what we already know from the summary table

3.8.2 Box plot

```{r}
penguins |>
  ggplot(aes(x=species, y=body_mass_g)) + # what we want to plot
  geom_boxplot(fill="#a6bddb",notch=FALSE) + # what kind of plot we want
    geom_jitter(colour = "grey50", fill = "grey60", width = 0.2, alpha = 0.5) +
  labs (x = "species",
        y = "Body mass (g)") +
  theme_cowplot() # choose a theme to give the plot a 'look' that we like
```

Here, we have added the points themselves on top of the box plot. When there are not too many data points, this can be useful. It indicates the sample size and gives more detail as to the actual distribution of the points. The ‘jitter’ adds some horizontal or vertical jitter, or both, so that the points do not lie on top of each other. In this case we see that the variability of body masss is not the same for each species and that the data are roughly symmetrically distributed around the median values in each case. This information is useful in helping us determine which statistical test might be appropriate for these data.

3.8.3 Violin plot

A useful alternative to the box plot, especially when the data set is large, is the violin plot:

```{r}
penguins |>
  ggplot(aes(x = species, y = body_mass_g)) + # what we want to plot
  geom_violin(fill="#a6bddb",notch=TRUE) + # what kind of plot we want
  #geom_jitter(width=0.1, colour = "#f03b20",alpha=0.5) +
  labs (x = "species",
        y = "Body mass (g)") +
  theme_cowplot() # choose a theme to give the plot a 'look' that we like
```

The widths of the blobs (I am probably supposed to call them ‘violins’!) show us the distribution of the data - where they are widest is where the data are concentrated, while the height of the blobs shows us the range of variation of the data. The positions of the blobs tells us the mean body masss of the different species and gives us an idea of the differences between them.

3.8.4 Ridge plot

A bit like a violin plot. This needs the package ggridges to be installed.

library(ggridges)
#| echo: fenced

penguins |>
  ggplot(aes(x = body_mass_g, y = species)) + # what we want to plot
  geom_density_ridges(fill="#a6bddb") + # what kind of plot we want
  #geom_jitter(width=0.1, colour = "#f03b20",alpha=0.5) +
  labs (x = "Body mass (g)",
        y = "Species") +
  theme_cowplot() # choose a theme to give the plot a 'look' that we like

Having seen the summary and one of these plots of the data, would you be inclined to reject, or fail to reject, a null hypothesis that said that there was no difference between the body masss of the three species?

3.9 Statistical analysis

Only now do we move on to the statistical analysis to try to answer our intial question(s). But by now, after the summary and plot(s), we may already have a pretty good idea what that answer will turn out to be.

The exact form of the analysis could take many forms. In a typical ecology project you might carry out several types of analysis, each one complementing the other. Here, an appropriate analysis might be to use the linear model in the form of a one-way ANOVA, since we have one factor (species) with three levels (Adelie, Chinstrap and Gentoo) and an output variable that is numeric and likely to be normally distributed. We can use the lm() function for this.

3.9.1 Create the model object

```{r}
penguin.model <-lm (body_mass_g ~ species, data = penguins)
```

3.9.2 Check the validity of the model

We won’t go into this here, but an important step is to check that the data satisfy the often finicky requirements of whatever statistical test we have decided to use. The autoplot() function form the ggfortify package is great for doing this graphically.

```{r}
autoplot( penguin.model) + theme_cowplot()
```

Here we note in particular that although the spread of data within each level is not roughly the same (top left figure)), the QQ plot is pretty straight (top-right figure). This means that the data are approximately normally distributed around their respective means. Taken together, this means that these data satisfy reasonably well the requirements of a linear model, so the output of that model should be reliable.

3.9.3 The overall picture

Typically, statistical tests are testing the likelihood of the data being as they are, or more ‘extreme’ than they are, if the null hypothesis were true. Thus, the null hypothesis is central to statistical testing.

The null hypothesis is typically that the ‘nothing going on’, ‘no difference’ or ’ no association’ scenario is true. In this case, it would be that there is no difference between the body masss of the the three species of penguin being considered here.

Typically too, a test will in the end spit out a p-value which is the probability that we would have got the data we got, or more extreme data, if the null hypothesis were true. Being a probability, it will always be a value between 0 and 1, where 0 means impossible, and 1 means certain. The closer the p-value is to zero, the less likely it is we would have got our data if the null hypothesis were true. At some point, if the p-value is small enough, we will decide that the probability of getting the data we actually got if the null hypothesis were true is so small that we reject the null hypothesis. Typically, the threshold beyond which we do this is when p = 0.05, but we could choose other thresholds. (Sounds arbitrary - yes, it is, but the choice of 0.05 is a compromise value that makes the risk of making each of two types of error - rejecting the null when we should not, and failing to reject it when we should, both acceptably small. This is a big topic which we won’t explore further here.)

In the end, whatever other information we get from it, the outcome of a statisical test is typically that we either reject the null hypothesis or we fail to reject it. If we reject it then we are claiming to have detected evidence for an ‘effect’ and we go on to determine how big that effect is and whether it is scientifically interesting. If we fail to reject the null, that does not necessarily mean that there is no ‘effect’ (difference, trend, association etc). That might be the case, but it might also just mean that we didn’t find evidence for one from our data.

It is all a bit like in a law court where the ‘null hypothesis’ is that the defendant is innocent, and at the end of the proceedings this null is either rejected (Guilty!) because the evidence is such as to make it untenable to hold onto the null hypothesis, or not rejected, because the evidence is not strong enough to convict, in which case the defendant walks free - but is not declared innocent. Formally, the court has simply found insufficient evidence to convict. In the latter case, the court would have failed to reject the null hypothesis. Crucially, it would not have declared that the defendant was innocent. In the same way, in a scientific study, we either reject or fail to reject a null hypothesis. We never ever accept the null hypothesis as true.

Actually, many researchers are unhappy wih this so-called ‘frequentist’ narrative and have sought to use an alternative ‘Bayesian’ approach to testing hypotheses. In this approach we can accept hypotheses and we can bring in prior knowledge. This is an interesting topic, but a very big one so we will not pursue it further here.

With all that behind us, we are in a better place to understand what the output of the test is telling us.

For the 1-way ANOVA, as with other examples of the linear model, this output comes in two stages:

3.9.4 Overall picture

Is there evidence for a difference between at least two of the mean values?

To see if there is evidence for this, an ANOVA test calculate the ratio between the dfference betweeN the groups compared to the differences within the groups. it calls this ratio \(F\). The bigger \(F\) is, the more likely we are to reject the null hypothesis that there is no difference between he groups.

```{r}
anova(penguin.model)
```

Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value Pr(>F)    
species     2 1.47e+08 73432107     344 <2e-16 ***
Residuals 339 7.24e+07   213698                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is huge. The null hypothesis of this test, as with many tests, is that there is no difference between the body masses of the three populations from which these samples have been drawn. In that case, the F value would be one. The p-value is telling us how likely it is that we would get an F value as big or bigger than the one we got for our samples if the null hypothesis were true. Since the p-value is effectively zero here, we reject the null hypothesis: we have evidence from our data that there is a significant varation of body mass between species.

The degrees of freedom Df tells us the number of independent pieces of information that were used to calculate the result. Let’s not dwell on this here, but there are two that we have to report in this case: the number of levels minus one ie 3-1 = 2, and the number of individual data points minus the number of levels ie 342-1 = 339.

3.9.5 Effect size

Now that we have established that at least two species of penguin have differing body masses, we go on to investigate where the differences lie, and how big they are. This is important: effect sizes matter. It is one thing to establish that a difference is statistically significant (and typically even the tiniest difference can show up as significant in a study if the sample size is big enough), it is quite another to establish whether the difference is big enough to be scientifically interesting.

```{r}
summary(penguin.model)
```


Call:
lm(formula = body_mass_g ~ species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-1126.0  -333.1   -33.1   316.9  1224.0 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3700.7       37.6   98.37   <2e-16 ***
speciesChinstrap     32.4       67.5    0.48     0.63    
speciesGentoo      1375.4       56.1   24.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462 on 339 degrees of freedom
Multiple R-squared:  0.67,  Adjusted R-squared:  0.668 
F-statistic:  344 on 2 and 339 DF,  p-value: <2e-16

The output here is typical of that from a 1-way ANOVA analysis in R. Each line refers to one of the three levels of the factor being investigated, which in this case is body mass. By default, those levels are arranged alphabetically, so in this case the order is Adelie, Chinstrap then Gentoo. The first row is always labelled (Intercept), so here that row is referring to Adelie. This level is used as the ‘control’ or reference level- the one with which the others are compared. If we are happy to have Adelie as that control then we can just carry on, but if we are not, then we have to tell R which level we want to play that role. We’ll go through how to do that later on.

In the Estimate column the value 3700.66 g in the first row refers to the actual mean body mass of the Adelie penguins in the sample. If we go back to the summary table we created earlier on, or look at one of the plots we created, we see that that is the case.

For all other rows, the value in the Estimate column is not referring to the absolute mean body mass but to the difference between the mean body mass for that species and the mean body mass of the control species. So we see that the mean body mass of the Chinstrap in our sample is 32.43 g greater than that of Adelie and so is equal to 3700.66 + 32.43 = 3733.09 g, while that of the Gentoo is 1375.35 g greater and so is equal to 5076.01 g. Check from the table of mean values we created and the plots that this is correct.

Here though, we are not interested in absolute values so much as we are in differences, which is why that is what the summary table here gives us. Look again at the differences between the mean body masss for Chinstrap and Gentoo and that for Adelie and compare them with the standard errors of those differences, which are given in the second column of the table. These standard errors are much smaller than the differences, meaning that we can have confidence that the differences are statistically significant.

This is borne out by the p-value in the right hand column of the table. The null hypothesis of this table is that there is no difference in body mass between populations of the different species from which these samples have been drawn.

Lastly, the adjusted \(R^2\) tells us the proportion of variation of body mass that is accounted for by taking note of the species. Here, the value is 0.668, which tells us that about two thirds of the variability in body mass is aconted for by species.

3.10 Report in plain English

You would say something like

We find evidence that body masss are not the same acros thhree species of penguin, with Gentoo penguins having a signifiantly greater mass than Adelie and Chinstrap. (ANOVA, df = 2, p < 0.001)