3. RStudio Data Processing

In this second part, we’ll perform some basic processing on some synthetic forest inventory data.

3.1. Load external data into R

The most straightforward way to load data into R is to use comma-separated values (.csv) format, a text file format that separates records across rows and fields with commas. CSV files can be exported from spreadsheet software (e.g. Microsoft Excel).

The .csv file we’ll use is named plot_data.csv, and can be downloaded here.

Save plot_data.csv somewhere memorable (e.g. Desktop). You can open this file with a text editor (e.g. Notepad) or as a spreadsheet (e.g. Microsoft Excel) if you’d like to take a look at its contents.

This file contains synthetic data for 25 forest plots in miombo woodlands each sized 50 x 50 m (0.25 ha). Each row contains data for a unique stem, where stems have a range of properties (e.g. DBH, height, species, etc.).

Attention

The data we’ll use in this exercise is entirely made up. Definitely don’t use this dataset for any quantitative work!

To load the data into R, we’ll need two commands. First, we’ll need to set the ‘working directory’, which is the location that R will run and where we will reference the CSV file relative to.

Set the working directory to the location you saved the data to as follows:

> setwd('C:/Users/your_username/Desktop/')

Note

  • Make your you replace your_username with your actual username.
  • Note that the directory is referenced with / rather than \, this is because \ has a special meaning in R.
  • If you saved plot_data.csv at any location other than the Desktop, you’ll need to reference that location instead.

To load the data into R, you can use the read.csv() function, as follows. Be careful with inverted commas, and don’t worry for now about what stringsAsFactors = FALSE means:

> df <- read.csv('plot_data.csv', stringsAsFactors = FALSE)

Once you’ve successfully loaded in the data, you can take a peek at the first few lines of it using the head() command:

> head(df)
plot_code tag_id                  species  DBH height status x_location y_location
1   plot_01   A001  Parinari curatellifolia  9.8   12.0  alive        1.2        1.9
2   plot_01   A002       Terminalia sericea 44.9   24.0  alive        4.3        3.4
3   plot_01   A003       Afzelia quanzensis  6.3    8.5  alive        1.6        3.8
4   plot_01   A004  Erythrophleum africanum 16.9   20.0  alive        2.0        4.6
5   plot_01   A005       Terminalia sericea  9.9    9.0  alive        0.1        7.5
6   plot_01   A006 Brachystegia spiciformis 30.7   22.0  alive        3.1        8.3

Don’t go any further until you’ve been able to replicate the output above.

3.2. Data frames

What you’ve loaded in is callled a ‘data frame’. Data frames are a format that store data like a table. The top line of the table is called the header, and contains the name of each column, and following rows denote data points. In this case, columns contain a range of properties of trees (species, DBH, height, etc.), and rows contains data for an individual stem.

Data frames can contain multiple types of data, but each column of a data frame has to be a single type. For example, in this case the column plot_code is a character, and DBH is a numeric data type.

Individual columns of the data frame can be accessed using the $ operator, which returns a vector. For example:

> df$DBH
[1]  9.8 44.9  6.3 16.9  9.9 30.7  6.9  9.1  6.9 20.6 28.7 45.8  5.7 12.8  9.7 20.7  8.9 13.5  5.5  5.5  6.2 10.3 27.9
[24]  6.1 36.7 19.2  5.5  8.5  5.6 38.6 17.2 17.3  9.1 24.3 12.4 13.0 15.9 16.4  8.2  6.2  7.0 31.4  6.1 41.5  8.0 28.0
[47]  6.3 85.6  8.2 53.5 37.7  9.8 16.7 13.3 10.0  6.6 19.2 11.9  9.2 14.1  8.0 10.0 23.3  9.6  5.3 18.4 27.0 23.8 71.4
[70] 13.0 52.1 13.8 23.6  6.4 11.9 13.4 10.6 44.2  7.5 10.6 44.5 11.1 13.9 24.5 25.0  7.2  8.9 42.1 13.1 17.7 19.7 19.0
[93] 21.8 22.0  9.7 23.6  8.1 15.0  5.6 23.7 15.9 21.5  8.2  8.6 14.9  7.7 13.5 12.1  5.8  6.8 15.5 83.2  6.9 35.6  8.4
[116]  6.4  9.7 22.8 16.8 12.8  5.8 29.0  5.8 14.0  5.2 12.8 10.6 17.7  8.0 23.4 31.4 26.2 17.7 18.7  7.1 11.5 29.5  7.4
[139]  9.3 13.0 11.4 18.1 49.7  6.7  5.9 23.4  7.9  5.4  6.3  7.1 13.8 27.6 13.6  9.7 16.1  5.0 48.0 26.0  6.1  6.7 17.3
[162] 19.0 12.3 21.6  5.0 16.5 28.2 22.3 18.5 17.4 31.6 23.7  8.0 22.5 14.6 19.5 13.8 74.2 13.7 24.7 18.0 10.8 11.4 30.3
[185] 11.2 17.8 38.4  9.4 32.1 16.9 19.4 11.6 17.1 35.3  6.5 28.6 11.5 38.6 15.9  7.5  9.2 40.8 11.5  8.9 22.1 12.3 35.2
[208] 27.3 20.7 21.3  7.1  5.4  7.9 16.6 31.0  6.0 30.3  9.4 26.8 19.9 22.4 22.1 49.0 13.6 28.1 28.8  6.5  9.9 23.2 21.4
[231] 19.6 49.0 11.1 25.4 14.2 10.2  9.4  5.8 38.1 33.1 11.1 16.1 17.1 15.5 19.5  7.8 20.5 28.0 29.3  9.1 44.0 24.3 42.3
[254]  9.2 15.7  7.2 16.1 44.2 18.2 15.9 34.4  5.5  7.5 12.2 11.4  7.8 20.1 39.7  6.3 10.8  9.1  6.9  8.2  5.3 11.1 14.4
[277] 31.7  5.6 20.5 16.3  5.4 12.0  7.4 19.5 16.4 27.9 15.6 22.9 57.1 15.4  6.5 13.0 33.0  7.9  8.9 19.6 12.4 20.1 16.7
[300]  6.7  5.9 12.4 23.5  9.5 13.8  5.4 40.9 22.0 31.0 19.8  6.7 41.4 42.1  5.9 30.9 20.9 40.4  6.5  8.1  8.2 14.9 15.2
[323]  7.4 19.1 18.6 10.1 18.1 19.2 44.7 41.4  7.4  5.3 22.2 19.7 10.6 39.2 20.7  8.1 15.2 12.8  7.9  6.1 27.1 19.4 15.2
[346] 16.0 44.5 20.3  8.1 47.8 11.1 11.6 11.4 18.6 25.1 10.4  5.5  9.1  8.7  7.3 12.2 26.8 25.2 17.8  9.2 11.6 11.9 24.0
[369] 12.8 18.9  9.0 11.4  9.5  9.4  7.7  7.6  6.9  5.2 10.6  7.9  7.0 18.9 25.4 15.7 11.9 26.6 48.3 11.4  5.2 24.0 19.4
[392] 10.4 24.4 36.4 10.7  7.4  7.5  7.8 23.1 21.1 15.7 16.8 23.0 10.1 33.0  5.7  9.5 24.6  5.4  5.6 21.9 27.6 15.9  6.6
[415]  5.0 17.5 20.3  6.5 12.1 13.6 20.6  9.4  9.4  8.5 13.6  7.4 10.0 10.0 11.6 14.8 14.4 13.7  5.9 55.1  7.9 11.1 35.6
[438] 12.8 32.6 19.7 16.2 13.3  9.7 15.1 53.9 28.1 22.3  9.1 13.8 12.5 35.0  6.6 33.0 10.0  8.1 31.0 18.6 57.8  9.0  5.2
[461] 11.1 46.2 11.2 30.1  8.3 11.9  7.8 37.7  9.9 23.3 23.8 12.9  5.9  8.5  7.7 18.2  5.6  5.4  5.4 30.5 18.2 11.9  8.8
[484] 30.1 20.1 15.3  6.7  6.6 18.6 17.2  5.4 14.6 15.7 16.1 31.9 42.2  8.2 25.7 20.0  9.4 23.1 18.2 22.0  8.7 12.2 35.4
[507] 17.6  7.0 11.5  6.7 13.1  5.2  6.6 19.4 12.8 13.8 32.2 16.3 17.3  6.9 59.2 15.5 11.5 22.4 32.5 17.5 15.6 13.1  9.6
[530] 10.0  9.5 57.9 26.1  6.7 18.6  5.3 15.0  8.1 20.9 39.3 15.4  6.1 15.9  6.5  7.5 13.7 30.7 28.0 44.0  8.2 12.5 12.5
[553] 19.1 30.7 28.1 12.5  8.3 34.7  8.4 81.8 60.6 17.3 14.1 20.4  9.2  6.2 16.4 27.4 13.4 18.2 20.7 36.9 34.9  6.4 51.2
[576] 11.1  5.2  7.2 43.8 19.1 45.5 17.2 16.6 21.7 23.8  5.0 28.1  6.0 11.1 16.8 14.5  5.4 16.2 10.7 17.4 28.3 19.4 43.6
[599] 27.5 12.7 16.3  9.8 13.4 24.6 18.5 38.6  7.4 15.4  5.2 11.4 33.2 32.9 12.8 31.9  7.7  6.9 13.6 10.7  7.3  7.0  9.9
[622]  5.5 15.8 37.9  7.9  6.4 49.0  9.3 33.6  5.7  6.5 52.6  8.5  5.1  6.8 12.0 10.8 24.3  9.6  5.9  5.4 30.6 28.8 30.9
[645]  7.0 24.8  5.4  9.9 21.5 16.3  6.0  9.6 25.9 17.3  6.9 32.7 18.3  6.7 22.9  6.9 14.2  9.9 26.6 50.8 12.8 20.5 38.5
[668] 13.8  5.9 23.0 14.4 20.0 35.4  9.1 65.5 43.6 11.6 29.9  6.0 25.7 40.9  8.9 19.1 15.3 13.8 33.3 11.1  5.9 12.7 11.1
[691]  7.1  7.2 10.7  7.4  6.8 10.3 13.7 14.2 22.8 33.7  6.2 12.9  7.3 35.0  8.9  6.5 25.1 20.2  5.4  8.5  9.3  7.9  7.5
[714] 21.6 18.4  5.9 14.4 50.7 60.3 16.7  7.9 10.2 15.4  6.4 11.7 11.8 17.8  9.2 11.1  7.8 19.4 30.8 14.9  7.5 36.6 66.8
[737]  8.7 38.7  5.4 12.1  9.8  9.7 10.8 40.3 10.0 23.4  5.4 36.2 10.4  7.4  6.3 16.6  7.6 24.8 66.4 16.4 12.1 30.9  9.9
[760]  8.6  9.5 10.3 20.9 12.0 15.8  9.3  8.9 21.2  8.9 52.4  7.8  6.6  5.8  6.2 11.5  6.8 22.8 39.2 20.0  8.0 16.5  6.4
[783] 12.0 16.5 19.6 13.5 15.6 12.6 19.9 19.5 20.7 17.2  6.6 25.8  7.9 10.4  6.6 12.1 10.1 31.9 31.9 11.2  8.1 26.9  6.8
[806] 11.3 18.2  5.6 11.5  8.8 13.6  8.1  9.0 15.8 10.3 30.4 24.5  9.1 12.3  9.5 19.0 35.9 15.3 24.2  6.6 23.7 15.0  6.8
[829] 19.4 15.2 21.3 27.1 33.6 14.9  6.4 19.1  5.5 89.0  5.6 10.2 85.1  6.6 10.1 10.0 10.3 19.4 11.0 17.2 20.6  5.7 19.9
[852] 32.7  5.5 19.5 17.4 12.8 37.1 11.7 34.6 12.6 57.7 27.3  5.2  5.8 12.2 14.7 23.5 25.8 30.6 20.5 23.0 15.4 26.0  7.0
[875] 11.2 14.2 10.0 26.3 38.0  9.1  9.3  6.8 15.4  7.6 14.9  5.9 22.6 32.7  5.2 10.0 12.4 13.6 32.6 16.4  8.8 12.2 55.3
[898] 12.2 63.4 20.5  6.0 21.3 41.3 34.9 17.4  5.9 11.4 15.5  9.9  9.7 19.2 16.6 10.8 10.6 47.6 12.4 24.2 16.3 29.6 16.1
[921]  5.7  7.3  8.7 14.2 53.5  8.5 53.8  9.6 15.8  6.1  9.7 10.1  5.8 64.5 11.0 40.3 12.7  6.9 15.3 14.7 39.8 20.4 19.9
[944] 21.3  7.9 27.9  7.5 14.0 19.3 26.5 16.5  5.7 15.6 22.0  7.9 12.7  6.2 23.4  8.8 26.8 27.1 48.6 21.7 16.3  5.5 14.7
[967]  6.7 17.4 45.7 45.7 23.9 16.0  7.3 10.9 43.9 13.3  8.2 44.1  6.5  7.3 12.8  9.7 12.6 21.5 27.5  9.8  7.0 22.3  5.2
[990] 19.0 15.1 12.1 14.3  6.8 14.7  7.8  5.3 11.6 13.0  5.4
[ reached getOption("max.print") -- omitted 2580 entries ]

3.2.1. Exercises

Now it’s your turn!

  1. Calculate the mean average DBH in this dataset (HINT: recall the mean() function.
  2. Try to extract a vector of species names from the data frame.

3.3. Generating statistics

Often we’ll want to summarise our data rather than view a raw vector of numbers. A good way to do this is with the summary() function, which returns key statistics for each column of a data frame:

> summary(df)
plot_code            tag_id            species               DBH           height         status            x_location      y_location
Length:3580        Length:3580        Length:3580        Min.   : 5.0   Min.   : 2.50   Length:3580        Min.   : 0.00   Min.   : 0.00
Class :character   Class :character   Class :character   1st Qu.: 8.4   1st Qu.:10.50   Class :character   1st Qu.:12.68   1st Qu.:11.88
Mode  :character   Mode  :character   Mode  :character   Median :13.6   Median :16.00   Mode  :character   Median :25.40   Median :25.25
                                                        Mean   :17.5   Mean   :16.95                      Mean   :25.31   Mean   :25.04
                                                        3rd Qu.:22.4   3rd Qu.:22.00                      3rd Qu.:38.10   3rd Qu.:37.80
                                                        Max.   :92.9   Max.   :57.50                      Max.   :50.00   Max.   :50.00

We can also apply summary() to an individual column of a dataframe:

> summary(df$DBH)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    5.0     8.4    13.6    17.5    22.4    92.9

Other individual statistics can be calculated with built-in R functions, for example:

> mean(df$DBH)
[1] 17.50045

> median(df$DBH)
[1] 13.6

> sd(df$DBH) # Standard deviation
[1] 12.38198

> max(df$DBH)
[1] 92.9

When looking at character data, we can use other forms of summarising data. For example:

> unique(df$plot_code) # Return unique plot values
[1] "plot_01" "plot_02" "plot_03" "plot_04" "plot_05" "plot_06" "plot_07" "plot_08" "plot_09" "plot_10" "plot_11" "plot_12" "plot_13" "plot_14" "plot_15"
[16] "plot_16" "plot_17" "plot_18" "plot_19" "plot_20" "plot_21" "plot_22" "plot_23" "plot_24" "plot_25"

> table(df$plot_code) # Return number of stems in each plot
    plot_01 plot_02 plot_03 plot_04 plot_05 plot_06 plot_07 plot_08 plot_09 plot_10 plot_11 plot_12 plot_13 plot_14 plot_15 plot_16 plot_17 plot_18 plot_19 plot_20
    159     120     233     263     165     154     188      93     139      46     229     192      33     117     256      70     204     135     199     206
plot_21 plot_22 plot_23 plot_24 plot_25
    70     108      56      73      72

R also lets us modify and create new columns. Let’s say that we wanted to calculate DBH (currently in units of cm) in metres and save it to a new column:

> df$DBH_m <- df$DBH / 100

> summary(df$DBH_m)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.050   0.084   0.136   0.175   0.224   0.929

We can this new column to calculate basal area, which is usually expressed in units of m^2:

> df$basal_area <- pi * (df$DBH_m / 2) ** 2    # pi * r^2

> summary(df$basal_area)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
0.001963 0.005542 0.014527 0.036092 0.039408 0.677831

3.3.1. Exercises

  1. What is the mean and standard deviation of tree heights in the entire dataset?
  2. How many different species are there in this dataset?
  3. What are the three most common species in the dataset?
  4. Can you figure out what was the minimum DBH measured in these plots?

3.4. Aggregation

We often need to aggregate stem data to plot summaries. This is where using R in place of spreadsheet software starts to become very powerful.

Let’s use the basal_area column we defined earlier to calculate the total basal area of each plot. To sum the basal area in each plot we can use the aggregate() function. Be careful here, the syntax of this function differs from what we’ve covered already.

> df_plot <- aggregate(basal_area ~ plot_code, df, sum)

> df_plot$basal_area <- df_plot$basal_area / 0.25 # Translate to units of m^2/ha, as our plots are 0.25 ha in size

> df_plot
plot_code basal_area
1    plot_01  25.084091
2    plot_02  19.177783
3    plot_03  30.126482
4    plot_04  43.599101
5    plot_05  27.744608
6    plot_06  21.809257
7    plot_07  28.908511
8    plot_08  12.148756
9    plot_09  18.025749
10   plot_10   6.285862
11   plot_11  31.490451
12   plot_12  24.466042
13   plot_13   6.756331
14   plot_14  14.503772
15   plot_15  33.763903
16   plot_16   9.690883
17   plot_17  29.543069
18   plot_18  19.004022
19   plot_19  28.991242
20   plot_20  31.032360
21   plot_21   9.942044
22   plot_22  14.048863
23   plot_23  14.682639
24   plot_24   8.452713
25   plot_25   7.557422

The aggregate function produced a new data frame, where instead of each row being an individual stem each row instead relates to a plot. We can run this new data frame through some of the functions we’ve already learned. For example:

> summary(df_plot$basal_area)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 6.286  12.149  19.178  20.673  28.991  43.599

> mean(df_plot$basal_area)
 [1] 20.67344

We can substitute the sum part of the aggregate function for another statistic. For example, to calculate stocking density, we might use:

> df_sd <- aggregate(tag_id ~ plot_code, df, length) # Calculate number of unique tags

We can combine this with our already made plot scale dataframe by adding a new column to df_plot:

> df_plot$stocking_density <- df_sd$tag_id

3.4.1. Exercise

Instead of by plot, try to aggregate the data by species. Use the data frame you produce to figure out:

  1. What is the average basal area of each species? What are the dominant species in this dataset?
  2. What species has the lowest total basal area?

3.5. Making plots

R is also very well suited to making publication quality plots. We can produce plots with the plot command. For example:

plot(height ~ DBH, df)
_images/DBH_vs_height.png

This plot shows a positive relationship between DBH and stem height, as might be expected.

3.5.1. Exercise

  1. Using `df_plot, can you build a plot of stocking density vs basal area? The resulting plot should look something like this:
_images/sd_vs_ba.png

3.6. Overview

Well done! Again, don’t worry if it didn’t all make sense, programming takes a long time to understand.

Here you’ve been introduced to data analysis in R. You should now understand:

  1. How to load data into R (i.e. read.csv).
  2. How to use and manipulate data frames (e.g. the $ operator).
  3. How to generate summary statistics with data frams (e.g. summary).
  4. How to aggreate data (i.e. aggregate).
  5. How to make simple plots using data (i.e. plot).