17  Community Ecology

As an ecologist you might visit several sites and at each one record either presence/absence or abundance data for whatever flora and/or fauna species you find there, together, possibly, with a range of environmental variables that characterise the site on that day in some way, such as properties of the soil, meteorological conditions, the site designation, aspect and so on.

You could have questions about each site, such as its species richness or its diversity, but you might also have questions about what links or differentiates the sites. Which are most and least similar to each other in terms of species assemblages, and how is this driven by the environmental variables?

In this tutorial we introduce some elementary techniques that enable us to address these questions.

17.1 Preliminaries

The vegan package used here provides tools for descriptive community ecology. It can be found on CRAN.

Some of the data used here is from Mark Gardener’s book: Statistics for Ecologists Using R and Excel.

You will need to add the data sets plant_species_lists.csv and bird.csv to your data folder inside your project folder.

This worksheet assumes that you are using RStudio and have it open on your laptop.

You should have a project folder on your laptop where you are doing the R work related to this module. You will have given it a name that is meaningful and useful to you. In what follows, I will refer to it as your RStuff folder.

You should be working within this Project. If you are not doing so already, go to File/Open Project then navigate to your RStuff (or whatever you have called your project) folder and click on the RStuff.Rproj file within it. RStudio will then restart and you will see the name of your project folder at the top-right-hand corner of the RStudio window.

Using File/New File/R Notebook, start a new R Notebook in your RStuff/scripts folder. Delete all the explanatory material beneath the yaml, amend the title in the yaml to something sensible then add author and date lines to the yaml, so that you end up with something like the yaml at the top of this script.

17.2 Load packages

library(tidyverse)
library(here)
library(vegan)
library(kableExtra)
# the cowplot package provides an excellent theme that helps plots to look good, and functions to arrange multiple plots within a figure
library(cowplot) 

Suppose we have visited several sites and listed the different species present at each site. We have not recorded how many of each species are present

17.3 Read in data

filepath<-here("data","plant_species_lists.csv")
plrich<-read_csv(filepath)
glimpse(plrich)
Rows: 183
Columns: 2
$ Site    <chr> "ML1", "ML1", "ML1", "ML1", "ML1", "ML1", "ML1", "ML1", "ML1",…
$ Species <chr> "Achillea millefolium", "Centaurea nigra", "Lathyrus pratensis…

There are two columns of data, one containing site names and one containing species names.

Here are the first few rows:

head(plrich)
# A tibble: 6 × 2
  Site  Species             
  <chr> <chr>               
1 ML1   Achillea millefolium
2 ML1   Centaurea nigra     
3 ML1   Lathyrus pratensis  
4 ML1   Leucanthemum vulgare
5 ML1   Lotus corniculatus  
6 ML1   Plantago lanceolata 

17.4 Species richness

Suppose that initially we just want to find how many species were recorded at each site. There are several ways we could do this:

17.4.1 Species richness the classic R way

Old-school R is terse and sometimes opaque, but it is also very powerful.

First we create a table indicating the presence/absence of each of the species at each of the sites.

ps<-table(plrich$Species,plrich$Site)
ps # - uncomment this line if you want to see the table we have just created -it is big!
                          
                           ML1 ML2 MU1 MU2 PL2 PU2 SL1 SL2 SU1 SU2
  Achillea millefolium       1   1   1   1   0   1   0   0   0   0
  Aegopodium podagraris      0   0   0   0   0   0   0   1   0   0
  Agrostis capillaris        0   1   1   1   1   0   1   1   1   0
  Agrostis stolonifera       0   0   0   0   0   1   0   1   0   0
  Anthriscus sylvestris      0   0   0   0   0   0   0   0   1   1
  Arctium minus              0   0   0   0   0   0   0   0   0   1
  Arrhenatherum elatius      0   0   0   0   1   0   0   1   1   1
  Bidens cernua              0   0   0   0   1   0   0   0   0   0
  Brachythecium rutabulum    0   0   0   0   1   0   1   1   1   0
  Bromus hordeaceus          0   0   0   0   0   0   1   0   1   0
  Calystegia sepium          0   0   0   0   0   0   0   1   0   0
  Capsella bursa-pastoris    0   0   0   0   1   1   0   0   0   0
  Cardamine pratensis        0   0   0   0   1   0   0   0   0   0
  Centaurea nigra            1   1   1   1   0   0   0   0   0   0
  Cerastium fontanum         0   0   1   0   0   1   0   0   1   1
  Chamerion angustifolium    0   0   0   0   0   0   0   0   1   0
  Chenopodium album          0   0   0   0   0   0   0   0   1   0
  Cirsium arvense            1   0   1   1   1   1   1   1   1   1
  Cirsium palustre           0   0   0   0   1   0   0   0   0   0
  Crataegus monogyna (s)     0   0   0   0   0   0   0   0   1   0
  Cynosurus cristatus        1   0   0   1   0   0   0   0   0   0
  Dactylis glomerata         0   0   0   0   0   0   0   1   1   1
  Deschampsia flexuosa       1   0   1   0   0   0   0   0   0   0
  Elytrigia repens           0   0   0   0   0   0   0   1   0   0
  Epilobium hirsutum         0   0   0   0   0   0   0   1   0   1
  Epilobium montanum         0   0   0   0   1   0   0   0   0   1
  Fallopia convolvulus       0   0   0   0   0   0   0   0   1   1
  Festuca rubra              1   0   1   0   0   0   0   0   0   0
  Filipendula ulmaria        0   0   0   0   0   0   0   1   0   0
  Fraxinus excelsior (s)     0   0   0   0   0   0   0   0   1   0
  Galium aparine             0   0   0   0   0   0   1   1   1   1
  Galium verum               0   1   1   0   0   0   0   0   0   0
  Geranium columbinum        0   0   0   0   0   0   1   0   0   0
  Geranium molle             0   0   0   0   0   0   1   0   0   0
  Glechoma hederacea         0   0   0   0   0   0   0   0   0   1
  Hedera helix (g)           0   0   0   0   0   0   0   0   0   1
  Heracleum sphondylium      0   0   0   0   0   0   0   0   1   1
  Holcus lanatus             1   1   1   0   0   1   1   1   1   1
  Impatiens glandulifera     0   0   0   0   0   0   0   1   0   0
  Juncus effusus             0   0   0   0   1   0   0   0   0   0
  Juncus inflexus            0   0   0   0   0   0   0   0   0   1
  Lathyrus pratensis         1   0   0   0   0   0   0   1   0   0
  Leucanthemum vulgare       1   1   1   1   0   0   0   0   0   0
  Lolium perenne             0   0   0   0   0   1   1   0   0   1
  Lotus corniculatus         1   1   1   1   0   1   0   0   0   0
  Phalaris arundinacea       0   0   0   0   0   0   0   1   0   0
  Plantago lanceolata        1   1   1   1   0   0   0   0   0   0
  Plantago major             0   1   1   1   0   0   0   0   0   0
  Poa pratensis              0   0   0   0   0   0   0   0   1   0
  Poa trivialis              0   0   0   0   0   0   0   1   0   0
  Prunella vulgaris          1   1   0   1   0   0   0   0   0   0
  Quercus robur (s)          0   1   0   0   0   0   0   0   1   0
  Quercus seedling/sp        0   0   1   0   0   0   0   0   0   0
  Ranunculus acris           1   0   1   0   0   0   0   0   0   0
  Ranunculus repens          0   1   1   1   1   1   1   1   1   0
  Rubus fruticosus agg (g)   0   0   0   0   0   0   1   1   1   1
  Rumex acetosa              0   1   1   1   0   0   0   0   0   0
  Rumex crispus              0   0   0   0   0   0   1   0   1   0
  Rumex obtusifolius         0   0   1   0   0   0   0   1   1   1
  Rumex sanguineus           0   0   0   0   0   0   1   0   0   0
  Salix caprea (s)           0   0   0   0   0   0   0   0   0   1
  Salix fragilis (s)         0   0   0   0   0   0   0   1   0   0
  Silene dioica              0   0   0   0   0   0   0   0   0   1
  Sonchus arvensis           0   0   0   0   0   0   1   0   1   1
  Sonchus asper              0   0   0   0   0   0   0   0   0   1
  Stachys sylvatica          0   0   0   0   0   0   0   1   1   1
  Symphytum officinale       0   0   0   0   0   0   0   1   0   0
  Tanacetum vulgare          0   0   0   0   0   0   0   0   0   1
  Taraxacum seedling/sp      0   0   1   1   0   0   0   0   0   0
  Thuidium tamariscinum      0   0   0   0   0   0   1   0   0   0
  Trifolium dubium           0   0   0   0   0   1   0   0   0   0
  Trifolium pratense         1   1   1   1   0   0   0   0   0   0
  Trifolium repens           1   1   1   0   1   1   0   0   0   0
  Urtica dioica              0   1   0   0   1   0   1   1   1   1
  Veronica arvensis          0   0   0   0   0   0   0   0   1   0
  Vicia hirsuta              0   0   0   0   0   0   0   0   1   1

Note that in this table, most of the entries are zeros. This is common in data recording of this kind. At any one site, most of the species are not found.

Then we calculate the sum of each column. This will give us the number of species found at each site.

psr<-colSums(ps)
psr
ML1 ML2 MU1 MU2 PL2 PU2 SL1 SL2 SU1 SU2 
 15  16  21  14  13  11  16  24  27  26 

Note that the actual number of species at each site will be greater than these values. How much greater, do you think, and how could we estimate the true richness from the observed richness?

17.4.2 Species richness the tidyverse way

We can use the group_by() and summarise() combo yet again. In this code chunk we finish off with the arrange() function from the dplyr package (the sub-package of tidyverse that is used for data manipulation) to present the sites in descending order of species richness.

plr<-plrich |>
  group_by(Site) |>
  summarise(species.richness=n()) |>
  arrange(Site)
plr
# A tibble: 10 × 2
   Site  species.richness
   <chr>            <int>
 1 ML1                 15
 2 ML2                 16
 3 MU1                 21
 4 MU2                 14
 5 PL2                 13
 6 PU2                 11
 7 SL1                 16
 8 SL2                 24
 9 SU1                 27
10 SU2                 26

Try tweaking the code so that you get this table in ascending order of richness, or in alphabetical order of site. Hint: try adjusting the argument of the arrange() function.

Plot of species richness

plrich |>
  group_by(Site) |>
  summarise(species.richness=n()) |>

  #ggplot(aes(x=Site,y=species.richness)) +
  ggplot(aes(x=fct_reorder(Site,species.richness),y=species.richness)) +
  geom_point(size=3) +
  labs(x="Site",
       y="Species richness") +
  theme_cowplot()

17.4.3 Species richness the vegan way

vegan is a package that brings with it a lot of functions intended for analysis of community ecology data. It’s style is ‘old-school’ R rather than tidyverse R, which is what we have mainly been using.

ps<-table(plrich$Species,plrich$Site)
specnumber(ps,MARGIN=2) #MARGIN=2 means summing over columns (not the usual vegan way!)
ML1 ML2 MU1 MU2 PL2 PU2 SL1 SL2 SU1 SU2 
 15  16  21  14  13  11  16  24  27  26 

17.4.4 Species richness of the dune data set.

This data set is built into the vegan package. It contains cover class values of 30 species at 20 sites, with the sites arranged by row and the species by column. This is the default expectation of functions in vegan. The species names are abbreviated to 4+4 letters.

data(dune)
head (dune)
  Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu
1        1        0        0        0        0        0        0        0
2        3        0        0        2        0        3        4        0
3        0        4        0        7        0        2        0        0
4        0        8        0        2        0        2        3        0
5        2        0        0        0        4        2        2        0
6        2        0        0        0        3        0        0        0
  Cirsarve Comapalu Eleopalu Elymrepe Empenigr Hyporadi Juncarti Juncbufo
1        0        0        0        4        0        0        0        0
2        0        0        0        4        0        0        0        0
3        0        0        0        4        0        0        0        0
4        2        0        0        4        0        0        0        0
5        0        0        0        4        0        0        0        0
6        0        0        0        0        0        0        0        0
  Lolipere Planlanc Poaprat Poatriv Ranuflam Rumeacet Sagiproc Salirepe
1        7        0       4       2        0        0        0        0
2        5        0       4       7        0        0        0        0
3        6        0       5       6        0        0        0        0
4        5        0       4       5        0        0        5        0
5        2        5       2       6        0        5        0        0
6        6        5       3       4        0        6        0        0
  Scorautu Trifprat Trifrepe Vicilath Bracruta Callcusp
1        0        0        0        0        0        0
2        5        0        5        0        0        0
3        2        0        2        0        2        0
4        2        0        1        0        2        0
5        3        2        2        0        2        0
6        3        5        5        0        6        0

To find the species richness of the sites we use the specnumber() function to find the number of non-zero values in each row:

specnumber(dune,MARGIN=1)
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 5 10 10 13 14 11 13 12 13 12  9  9 10  7  8  8  7  9  9  8 

Note that we use MARGIN=1 in this case, which instructs vegan to sum along rows since the sites are arranged by row, rather than by column. In fact, we could have left it out, since as we can see from the help for specnumber, the default value for MARGIN is 1, ie sites in rows, species in columns, which is the default expectation of vegan. If we wanted the opposite, sites in columns and species in rows, then we would have specified MARGIN=2.

17.4.5 Species richness of the dune data set - doing it the tidyverse way

You can take this section or leave it - I’ll take it!. If you find the vegan talk of MARGIN confusing, then you might prefer to use the tidyverse style to find the species richness of the sites in the dune data set.

In the code chunk below we note that the dune data set is not at all tidy. To be so, all 30 columns in the original data set should be collapsed into two, which here we will call “species” and “abundance”. We use the (I find) repeatedly useful function pivot_longer() to do this. Before we do this it will be useful to use the mutate() function to add a column that gives an ID number for each site.

dune |>
  mutate(id=seq(1:n())) |># add in a site ID column
  pivot_longer(1:30,names_to="species",values_to="abundance") |> #tidy the abundance data
  filter(abundance>0) |> # remove the lines where the the abundance is zero
  group_by(id) |> # group by site
  summarise(species.richness=n()) # count the number of species (rows) left for each site.
# A tibble: 20 × 2
      id species.richness
   <int>            <int>
 1     1                5
 2     2               10
 3     3               10
 4     4               13
 5     5               14
 6     6               11
 7     7               13
 8     8               12
 9     9               13
10    10               12
11    11                9
12    12                9
13    13               10
14    14                7
15    15                8
16    16                8
17    17                7
18    18                9
19    19                9
20    20                8

17.5 Diversity Index

Better stories can be told about Simpson’s index than about Shannon’s index, and still grander narratives about rarefaction (Hurlbert 1971). However, these indices are all very closely related (Hill 1973), and there is no reason to despise one more than others (but if you are a graduate student, don’t drag me in, but obey your Professor’s orders).

If we have abundance data. or each species at a site, for example percentage cover, we have more information than if had simply had presence-absence data. Hence, not surprisingly, w can Given the proportion \(p_i\) of each of several species \(i\) within a population, we can estimate the so-called diversity of the population in one of several ways from proportions \(\hat p_i\) of each species within a sample. The proportions can be of total counts or of total percentage cover.

17.5.1 Shannon’s Diversity Index

This calculates \(H\) where

\[H=-\sum{p_i\log_b p_i}\]

where \(b\) is the base of the log. Most commonly, natural logs to base \(e^1\) are used. Those are the ones often denoted ln on calculators and in some books. Or sometimes log_e, or sometimes just log. Confusing, eh? It is usually clear from the context when natural logs are being used.

17.5.2 Simpson’s and Inverse Simpson’s Diversity Index

These are based on the quantity

\[D=\sum_ip_i^2\]

Simpson’s index is \(1-D\) and varies from 0 to 1 while the inverse Simpson’s index is \(1/D\)

Note how this measure \(D\) makes sense as a diversity index. If a site had one species that comprised 100% of the cover of the site, ie \(p = 1\) for that species and \(p = 0\) for every other species, then the diversity of that site would be low, right? The value of \(D\) would be 1 and the Simpson’s index value would thus be zero.

On the other hand, if we had a more diverse site with each of ten species having a percentage cover of 10%, so that each of the \(p_i\) would be 0.1, then the value of \(D\) would be \(\sum_ip_i^2=10\times 0.1^2=0.1\) This would give a Simpson’s diversity index value of 0.9.

Try this out for a site with two species, or five, or 100, each with equal cover \(p_n\). What values do you get in these cases for the Simpson’s index?

\(N\) species \(p\) \(D\) Simpson’s Index = \(1-D\)
2 0.5 \(p_1^2 + p_2^2 = 0.5\) \(1-0.5 = 0.5\)
5 0.2 \(p_1^2 + \cdots p_5^2 = 0.2\) \(1-0.2 = 0.8\).
100 0.01 \(p_1^2 + \cdots p_{100}^2 = 0.01\) \(1-0.01 = 0.99\).

17.6 Calculation of bird species diversities in a range of habitats

As an example, let us suppose we have gathered count data on the numbers of unique individuals of different species of birds observed in a range of habitats. What are the diversities of bird species in each of the habitats?

Our data collection sheet might look like this:

bird<-read_csv(here("data","bird.csv"))
bird
# A tibble: 6 × 6
  Species       Garden Hedgerow Parkland Pasture Woodland
  <chr>          <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
1 Blackbird         47       10       40       2        2
2 Chaffinch         19        3        5       0        2
3 Great Tit         50        0       10       7        0
4 House Sparrow     46       16        8       4        0
5 Robin              9        3        0       0        2
6 Song Thrush        4        0        6       0        0

This is nice for us to read, but not the best for analysis. It is not ‘tidy’. Each of the habitats surveyed is really a different level of the single variable ‘habitat’. Hence, this variable is appearing across five columns. A tidy data set has any variable appear only once in any one row. When this is the case, analysis becomes much easier. We can tidy this data set using the command pivot_longer(), which we use here to produce a new version of the data set called tidy.bird, in which we have just two columns, habitat and abundance. This data set is tidy.

Note that tidy data sets tend to be longer and thinner than untidy data sets, with more rows and fewer columns.

In the chunk below, we start with the data frame bird, tidy it, then write the result into an object called tidy.bird.

tidy.bird<- bird |>
  pivot_longer(Garden:Woodland,names_to="habitat",values_to="abundance") |>
  arrange(habitat) # sort in order of habitat name

tidy.bird # have a look at it
# A tibble: 30 × 3
   Species       habitat  abundance
   <chr>         <chr>        <dbl>
 1 Blackbird     Garden          47
 2 Chaffinch     Garden          19
 3 Great Tit     Garden          50
 4 House Sparrow Garden          46
 5 Robin         Garden           9
 6 Song Thrush   Garden           4
 7 Blackbird     Hedgerow        10
 8 Chaffinch     Hedgerow         3
 9 Great Tit     Hedgerow         0
10 House Sparrow Hedgerow        16
# ℹ 20 more rows

In this next chunk we use the tidy data set to calculate the three diversity indices - Shannon, Simpson and Inverse Simpson, for each of the habitats. First we do this in a ‘home-made’ way, typing in the formulae for each index ourselves

tidy.bird |>
  group_by(habitat) |>
  filter(abundance>0) |>
  summarise(N=sum(abundance),
            shannon.di=-sum((abundance/sum(abundance))*log(abundance/sum(abundance))),
            simpson.di=1-sum((abundance/sum(abundance))^2),
            inv.simpson.di=1/sum((abundance/sum(abundance))^2)) |>
  arrange(-shannon.di) |>
  kbl(digits=3) |>
  kable_styling(full_width=FALSE)
habitat N shannon.di simpson.di inv.simpson.di
Garden 175 1.542 0.762 4.205
Parkland 69 1.248 0.617 2.609
Hedgerow 32 1.154 0.635 2.738
Woodland 6 1.099 0.667 3.000
Pasture 13 0.984 0.592 2.449

Now let’s do that using the diversity() function in vegan. I have left in the MARGIN and base arguments, but we could have left them out since we are happy to use their default values.

# MARGIN =2 means sum along rows

tidy.bird |>
  group_by(habitat) |>
  summarise(N=sum(abundance),
            shannon.di=diversity(abundance, index = "shannon", MARGIN = 2,base=exp(1)),
            simpson.di=diversity(abundance, index = "simpson", MARGIN = 2),
            inv.simpson.di=diversity(abundance, index = "invsimpson", MARGIN = 2)) |>
  arrange(-shannon.di) |> # sort rows in descending order of shannon diversity index
  kbl(digits=3) |>
  kable_styling(full_width = FALSE)
habitat N shannon.di simpson.di inv.simpson.di
Garden 175 1.542 0.762 4.205
Parkland 69 1.248 0.617 2.609
Hedgerow 32 1.154 0.635 2.738
Woodland 6 1.099 0.667 3.000
Pasture 13 0.984 0.592 2.449

In the next chunk we use the diversity() function alone to give us the Shannon diversities for each of the habitats. This line of code uses the original untidy data set bird, and asks that the counts in columns 2 to 6 be used to caclulate the diversities of each of the habitats. MARGIN=2 is the instruction to look down the columns (MARGIN=1 would mean look along the rows) , while [,2:6] is an example of how data fames can be spliced and diced. bird[2,3] would mean take the element at row 2, column 3, while [,2:6] means take the block of data in all the rows, and columns 2 to 6

diversity(bird[,2:6],MARGIN=2)
   Garden  Hedgerow  Parkland   Pasture  Woodland 
1.5422709 1.1538939 1.2483919 0.9839614 1.0986123 

In this example we see that the vegan package provides functions that do useful things for us that we sometimes could have done in another way, but where that way might have been more complicated. In this example, use of the diversity() function from vegan means that we don’t have to know or type in the actual formulae for whichever index we are after.

Now let us find the Shannon diversity index of the 20 sites in the dune dataset. That is, species in columns, sites in rows.

data(dune)
diversity(dune)
       1        2        3        4        5        6        7        8 
1.440482 2.252516 2.193749 2.426779 2.544421 2.345946 2.471733 2.434898 
       9       10       11       12       13       14       15       16 
2.493568 2.398613 2.106065 2.114495 2.099638 1.863680 1.979309 1.959795 
      17       18       19       20 
1.876274 2.079387 2.134024 2.048270 

See how easy this is to implement when we have arranged our data as in the dune dataset?

17.6.1 Are these diversities different?

If we have a number of habitats, and for each of them we have arrived at one or other of the diversity indices considered above, we might reasonably ask if there is a significant difference between the indices.

This is difficult. For each habitat we have a single number, so we cannot use a t-test/ANOVA, and the number is not a count, so we cannot use a chi-squared test.

17.7 (Dis)similarity

The last problem, of telling how similar or dissimilar sites are, based on the species assemblages found at them or on a clutch of environmental variables measured at each site, or both, is a common task in ecology.

In this and the following sections we will explore the concept of ‘dissimilarity’ and the various ways it can be calculated and displayed.

The vegan package has the function vegdist() which permits dissimilarity indices to be calculated for both presence-absence data and for data where abundances have been recorded in some form (eg count, frequency, percentage cover).

For more information on vegdist, particularly on the specifics of what the possible indices are that can be calculated, type ?vegdist into the console window and look at the Help information that appears in the bottom-right pane of RStudio.

Let us calculate the Jaccard index of dissimilarity between the sites of the dune data set. This data set, remember, gives cover class values of 30 species on 20 sites. Note that there is an accompanying data set dune.env which contains environmental data for each of the sites. This is a common scenario, whereby you gather both species and environmental data for a given site.

The Jaccard index of dissimilarity between two sites is the ratio of the number of species that they have in common to the total number of species contained in either one or both sites.

This can be calculated as follows:

  • Number of species in site A = A
  • Number of species in site B = B
  • Number of species common to both sites = C
  • Thus, total number of species across both sites = A + B - C

From which Jaccard Index J = C / (A + B - C)

If each site has exactly the same species assemblage, so that they are identical in this respect, then this index is 1. If there are no shared species between the sites so that there is no point of similarity, then this index is 0. Hence the Jaccard index J between two sites is always between 0 and 1.

For example if clifftop site A contains Armeria maritima, Jasione montana and Silene uniflora while clifftop site B contains Armeria maritima, Jasione montana and Anthyllis vulneraria, the number of species they have in common is two, while the total number of species between them is four so that the Jaccard index of dissimilarity between them \(J(A,B) = J(B,A) = \frac{2}{4}= 0.5\)

THe Sorensen index is similar

Since we have 20 sites, so that each site has to be compared with 19 others, and since the dissimilarity of site 1 and site 2 is the same as that of site 2 and site 1 ie J(1,2) = J(2,1), just as the distance from Bristol to Bath is the same as the distance from Bath to Bristol, this gives 190 distinct Jaccard indices for this data set.

dune.dist<-round(vegdist(dune,method="jaccard",binary=FALSE),2)
dune.dist
      1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
2  0.64                                                                      
3  0.62 0.51                                                                 
4  0.69 0.53 0.43                                                            
5  0.78 0.58 0.64 0.67                                                       
6  0.78 0.68 0.72 0.78 0.46                                                  
7  0.71 0.61 0.64 0.67 0.37 0.37                                             
8  0.79 0.70 0.49 0.58 0.78 0.74 0.69                                        
9  0.75 0.65 0.51 0.55 0.67 0.75 0.66 0.48                                   
10 0.73 0.45 0.64 0.65 0.52 0.48 0.43 0.70 0.75                              
11 0.72 0.70 0.71 0.74 0.77 0.62 0.62 0.69 0.75 0.58                         
12 0.96 0.83 0.61 0.69 0.82 0.78 0.77 0.61 0.52 0.84 0.80                    
13 0.91 0.75 0.60 0.68 0.81 0.86 0.78 0.54 0.58 0.85 0.86 0.52               
14 1.00 0.88 0.86 0.89 0.94 0.89 0.93 0.72 0.86 0.86 0.90 0.82 0.79          
15 1.00 0.95 0.83 0.85 0.92 0.89 0.91 0.60 0.80 0.92 0.85 0.77 0.81 0.53     
16 0.96 0.94 0.80 0.80 0.94 0.92 0.94 0.60 0.79 0.94 0.93 0.74 0.75 0.70 0.53
17 0.94 0.90 0.94 0.95 0.77 0.81 0.80 0.94 0.94 0.77 0.82 0.96 0.93 0.95 0.94
18 0.88 0.75 0.76 0.80 0.70 0.66 0.71 0.78 0.81 0.65 0.49 0.85 0.89 0.91 0.84
19 1.00 0.89 0.91 0.88 0.83 0.84 0.85 0.85 0.88 0.83 0.71 0.82 0.90 0.92 0.88
20 1.00 0.97 0.87 0.87 0.94 0.92 0.94 0.66 0.82 0.94 0.89 0.82 0.84 0.62 0.46
     16   17   18   19
2                     
3                     
4                     
5                     
6                     
7                     
8                     
9                     
10                    
11                    
12                    
13                    
14                    
15                    
16                    
17 1.00               
18 0.93 0.86          
19 0.95 0.72 0.71     
20 0.51 0.95 0.82 0.85

17.8 Visualising similarity

17.8.1 Heatmaps

One way to visualise this dissimilarity matric is to use a heat map, in which the darker colours indicate greater ‘dissimilarity’.

heatmap(as.matrix(dune.dist))

17.8.2 Dendrograms

As we have seen, the function vegdist() produces a matrix of distance values between every pair of sites. We used a heat map to give a visual indication of the overall pattern of dissimilarity between the sites. A dendrogram is another useful way to visualize this to help us identify where the greatest similarities and differences lie.

dune.hc1<-hclust(dune.dist)
dune.hc2<-hclust(dune.dist,method="ward.D2")
opar <- par(mfrow = c(1, 1))
plot(dune.hc2,hang=-1)

#plot(dune.hc2, hang=-1)
par(opar)

The dendrograms are a useful, but for most real data, approximate summary of the distance matrix. The key way to interpret them is that the height at which two sites are joined is an indication of their dis-similarity. The higher that is, the less similar they are. These dendrograms sometimes suggest which clusters may exist in your data, but they can also sometimes be misleading about that. They are most accurate near the bottom, so in this example the suggestion from the dendrogram that sites 6 and 7 are most similar to each other, similarly 3 and 4, is probably correct.

Note also that dendrograms become very hard to interpret when the number of data points (in this case sites) becomes very large. This is less often an issue in the field of ecology where we are typically dealing within a samll number of sites, as here, but in other settings such as molecular biology of the gene where the data may consist of thousands of DNA sequences, they become impossible to decipher, and also very expensive computationally.

Exercise

For the dune data, find the species richness and Simpson diversity index for each site, then create a dendrogram that illustrates which sites are most similar/dissimilar to each other

We suppose that we have already the data into a data frame called dune from a .csv file, which is arranged the vegan way, with sites as rows, species as columns.

# species richness
specnumber(dune)
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 5 10 10 13 14 11 13 12 13 12  9  9 10  7  8  8  7  9  9  8 
# Simpson's diversity index
diversity(dune, index = "simpson")
        1         2         3         4         5         6         7         8 
0.7345679 0.8900227 0.8787500 0.9007407 0.9140076 0.9001736 0.9075000 0.9087500 
        9        10        11        12        13        14        15        16 
0.9115646 0.9031909 0.8671875 0.8685714 0.8521579 0.8333333 0.8506616 0.8429752 
       17        18        19        20 
0.8355556 0.8614540 0.8740895 0.8678460 
# Dissimilarity dendrogram
dune.dist<-vegdist(dune)
dune.hc<-hclust(dune.dist)
plot(dune.hc)