library(daymetr)
library(tidyverse)
Data Basics
Learn the basics of manipulating data in R
In this workshop you will:
- Download climate data with daymetr
- Manipulate data with tidyverse:
- Format date elements
- Use
select()
to choose variables from a data frame. - Use
filter()
to choose data based on values. - Use
mutate()
to create new variables. - Use
reframe()
to summarize datasets. - Use
full_join()
to merge datasets
- Visualize data with ggplot2
- Learn to write basic functions
Data manipulation with tidyverse
Load the required libraries
What is the climate of Yale-Myers Forest?
Yale-Myers Forest is a 7,840 acre research forest in Connecticut. It is a hub for education, research, and harvesting operations within the Yale Forest School system. The forest primarily consists of mixed hardwoods on glacial till soils, featuring a significant presence of hemlock, scattered white pine stands from old field origins, and red pine plantations established in the 1940s.
To explore the climate of the area, download daily daymet data for (41.952909, -72.123859). Review the arguments for the download_daymet()
function.
download_daymet() ?
To use the download_daymet()
function you need the latitude, longitude, start (1990), and end (2018) years.
<- download_daymet( lat = 41.952909,
Yale.Myers.daymet lon = -72.123859,
start = 1990,
end = 2018)
Downloading DAYMET data for: Daymet at 41.952909/-72.123859 latitude/longitude !
Done !
You now have a nested list that includes information on the site location. The true climate data is stored in the “data” part of the nested list. Pull the data out of the list and create a dataframe.
<- as.data.frame(Yale.Myers.daymet$data) Yale.Myers
Look at the dataframe:
head(Yale.Myers )
year yday dayl..s. prcp..mm.day. srad..W.m.2. swe..kg.m.2. tmax..deg.c.
1 1990 1 32422.37 0.00 179.64 15.20 5.15
2 1990 2 32467.03 0.00 200.25 14.94 3.11
3 1990 3 32515.34 0.00 242.26 14.37 7.38
4 1990 4 32567.27 2.22 184.50 13.49 8.79
5 1990 5 32622.77 0.00 194.86 12.06 5.58
6 1990 6 32681.80 0.00 218.59 11.97 3.72
tmin..deg.c. vp..Pa.
1 -2.31 515.41
2 -5.39 408.97
3 -4.64 432.86
4 -3.91 457.55
5 -2.59 504.75
6 -5.77 397.13
Evaluate the data types:
summary( Yale.Myers)
year yday dayl..s. prcp..mm.day.
Min. :1990 Min. : 1 Min. :32180 Min. : 0.000
1st Qu.:1997 1st Qu.: 92 1st Qu.:35763 1st Qu.: 0.000
Median :2004 Median :183 Median :43200 Median : 0.000
Mean :2004 Mean :183 Mean :43200 Mean : 3.726
3rd Qu.:2011 3rd Qu.:274 3rd Qu.:50638 3rd Qu.: 3.560
Max. :2018 Max. :365 Max. :54220 Max. :110.330
srad..W.m.2. swe..kg.m.2. tmax..deg.c. tmin..deg.c.
Min. : 38.95 Min. : 0.00 Min. :-13.53 Min. :-25.980
1st Qu.:210.89 1st Qu.: 0.00 1st Qu.: 6.07 1st Qu.: -3.330
Median :306.97 Median : 0.00 Median : 15.14 Median : 3.490
Mean :313.20 Mean : 26.99 Mean : 14.48 Mean : 3.555
3rd Qu.:414.61 3rd Qu.: 35.57 3rd Qu.: 23.54 3rd Qu.: 11.820
Max. :678.41 Max. :245.38 Max. : 35.51 Max. : 22.950
vp..Pa.
Min. : 73.32
1st Qu.: 477.75
Median : 783.27
Mean : 945.93
3rd Qu.:1382.58
Max. :2800.24
It’s a Date
Format the date variables and create a date in the format YYYY-mm-dd :
Step 1. Use the year and day of the year to create a year.day variable.
$year.day <- paste( Yale.Myers$year, Yale.Myers$yday, sep="-") %>% as.Date(format="%Y-%j") Yale.Myers
Look at your work:
head(Yale.Myers$year.day)
[1] "1990-01-01" "1990-01-02" "1990-01-03" "1990-01-04" "1990-01-05"
[6] "1990-01-06"
Step 2. Format the year.day as YYYY-mm-dd and convert to a date.
$date <- format( Yale.Myers$year.day, "%Y-%m-%d") %>% as.Date() Yale.Myers
Look at your work:
head(Yale.Myers$date)
[1] "1990-01-01" "1990-01-02" "1990-01-03" "1990-01-04" "1990-01-05"
[6] "1990-01-06"
Check the class:
class(Yale.Myers$date)
[1] "Date"
mutate()
The function mutate()
creates new columns that are a function of existing variables. Use mutate()
to create a month (month) and Julian day (JulianD) from the formatted date.
<- Yale.Myers %>% mutate( month = format( Yale.Myers$date, format="%m" ),
Yale.Myers julianD = format( Yale.Myers$date, format="%j") )
With daymet data you have to calculate the mean temperature using tmax and tmin:
$tmean = (0.606 *as.numeric(Yale.Myers$tmax..deg.c.)) + 0.394 * as.numeric(Yale.Myers$tmin..deg.c.) Yale.Myers
Look at your work:
summary(Yale.Myers$tmean)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-18.411 2.316 10.573 10.176 18.877 30.297
rename()
Use the function rename()
to rename tmax..deg.c to tmax, tmin..deg.c to tmin, and prcp..mm.day. to prcp:
<-Yale.Myers %>% rename(tmax=tmax..deg.c., tmin=tmin..deg.c., prcp = prcp..mm.day.) Yale.Myers.rn
Look at your work:
head(Yale.Myers.rn )
year yday dayl..s. prcp srad..W.m.2. swe..kg.m.2. tmax tmin vp..Pa.
1 1990 1 32422.37 0.00 179.64 15.20 5.15 -2.31 515.41
2 1990 2 32467.03 0.00 200.25 14.94 3.11 -5.39 408.97
3 1990 3 32515.34 0.00 242.26 14.37 7.38 -4.64 432.86
4 1990 4 32567.27 2.22 184.50 13.49 8.79 -3.91 457.55
5 1990 5 32622.77 0.00 194.86 12.06 5.58 -2.59 504.75
6 1990 6 32681.80 0.00 218.59 11.97 3.72 -5.77 397.13
year.day date month julianD tmean
1 1990-01-01 1990-01-01 01 001 2.21076
2 1990-01-02 1990-01-02 01 002 -0.23900
3 1990-01-03 1990-01-03 01 003 2.64412
4 1990-01-04 1990-01-04 01 004 3.78620
5 1990-01-05 1990-01-05 01 005 2.36102
6 1990-01-06 1990-01-06 01 006 -0.01906
subset()
Subset the dataset to include only [date, month, year, julianD, tmax, tmin, tmean, and prcp] :
<-Yale.Myers.rn %>% select(date, month, year, julianD, tmax, tmin, tmean, prcp) Yale.Myers.sub
Look at your work:
head(Yale.Myers.sub )
date month year julianD tmax tmin tmean prcp
1 1990-01-01 01 1990 001 5.15 -2.31 2.21076 0.00
2 1990-01-02 01 1990 002 3.11 -5.39 -0.23900 0.00
3 1990-01-03 01 1990 003 7.38 -4.64 2.64412 0.00
4 1990-01-04 01 1990 004 8.79 -3.91 3.78620 2.22
5 1990-01-05 01 1990 005 5.58 -2.59 2.36102 0.00
6 1990-01-06 01 1990 006 3.72 -5.77 -0.01906 0.00
reframe()
Create an annual summary of conditions using reframe
()` :
<-Yale.Myers.sub %>%reframe(.by=year, tmax = max(tmax), tmin = min(tmin), tmean = mean(tmean), prcp = sum(prcp)) Yale.Myers.annual
Look at your work:
head(Yale.Myers.annual)
year tmax tmin tmean prcp
1 1990 32.41 -19.73 10.941330 1419.35
2 1991 35.03 -19.81 10.719331 1492.94
3 1992 32.78 -19.33 8.836160 1390.22
4 1993 33.02 -21.59 9.429097 1251.88
5 1994 32.66 -24.70 9.720943 1306.84
6 1995 34.60 -21.21 9.689669 1263.50
Determine the average monthly conditions using reframe
()` :
<-Yale.Myers.sub %>% reframe(.by =month, tmax = max(tmax), tmin = min(tmin), tmean = mean(tmean), prcp = sum(prcp)/29) Yale.Myers.monthly
Look at your work:
head(Yale.Myers.monthly)
month tmax tmin tmean prcp
1 01 18.95 -24.70 -2.740226 102.8438
2 02 20.75 -25.98 -1.620468 94.5469
3 03 28.78 -20.39 2.739755 127.2134
4 04 33.76 -9.78 9.184644 114.5007
5 05 33.55 -2.08 15.002559 108.8931
6 06 34.28 3.22 19.647075 116.0659
Determine the average conditions using reframe
()` by both year and month:
<-Yale.Myers.sub %>% reframe(.by= c(month, year) , tmax = max(tmax), tmin = min(tmin), tmean = mean(tmean), prcp = sum(prcp)) Yale.Myers.yearmon
Look at your work:
head(Yale.Myers.yearmon )
month year tmax tmin tmean prcp
1 01 1990 15.63 -9.41 1.22571871 124.95
2 02 1990 16.24 -19.73 0.09135071 119.02
3 03 1990 27.23 -17.71 4.16228645 63.76
4 04 1990 32.41 -5.48 8.59659000 156.66
5 05 1990 24.44 1.07 12.77852903 209.07
6 06 1990 28.55 4.62 19.91868733 59.14
select()
Subset the mean annual temperate and year from the annual summary. Then rename tmean with “Yale-Myers” :
<- Yale.Myers.annual %>% mutate(Yale.Myers = tmean) %>% select(year, Yale.Myers) Yale.Myers.annual.tmean
Look at your work:
head(Yale.Myers.annual.tmean)
year Yale.Myers
1 1990 10.941330
2 1991 10.719331
3 1992 8.836160
4 1993 9.429097
5 1994 9.720943
6 1995 9.689669
save()
Save the two files, Yale.Myers and Yale.Myers.sub, in a .RDTAT object. You will need this for the post workshop assessment:
save(Yale.Myers, Yale.Myers.sub,Yale.Myers.annual.tmean, file="data/products/DataBasics.RDATA" )
Introduction to ggplot2
ggplot2 is a plotting package in R. Graphics are built layer by layer by adding new elements. This approach allows for extensive flexibility and customization of plots.
Load the library:
library(ggplot2)
Building blocks of layers with the grammar of graphics:
- Data: The element is the dataset itself
- Aesthetics: The data is used to map onto the aesthetics attributes such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type
- Geometrics: How the data being displayed using point, line, histogram, bar, boxplot
- Facets: Display the subset of the data using columns and rows
- Statistics: Binning, smoothing, descriptive, intermediate
- Coordinates: the space between data and display using Cartesian, fixed, polar, limits
- Themes: Non-data link
Data Layer:
We first need to define the source of the information to be visualized. The function ggplot()
is used to generate the plots.
ggplot(data = Yale.Myers.sub) +
labs(title = "Climate of Yale Myers Forest")
Aesthetic Layer:
The data is used to map onto the aesthetics attributes such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type.
ggplot(data = Yale.Myers.sub, aes(x = date , y = tmean)) +
labs(title = "Climate of Yale Myers Forest")
Geometric layer:
The geometric layer controls the essential elements. Display the data using point, line, histogram, bar, or boxplot.
ggplot(data = Yale.Myers.sub, aes(x = date , y = tmean))+
geom_point() +
labs(title = "Climate of Yale Myers Forest",
x = "Date",
y = "Mean Air Temperature (Degrees Celcius)")
Geometric layer: Adding size, color, and shape.
# Adding size
ggplot(data = Yale.Myers.sub, aes(x = date , y = tmean, size=year))+
geom_point() +
labs(title = 'Climate of Yale Myers Forest',
x = "Date",
y = "Mean Air Temperature (Degrees Celcius)")
# Adding shape and color
ggplot(data = Yale.Myers.sub, aes(x = date , y = tmean, col=month))+
geom_point() +
labs(title = 'Climate of Yale Myers Forest',
x = "Date",
y = "Mean Air Temperature (Degrees Celcius)")
# Histogram plot
ggplot(data = Yale.Myers.sub, aes(x = tmean)) +
geom_histogram(binwidth = 5) +
labs(title = 'Climate of Yale Myers Forest',
x = "Mean Air Temperature (Degrees Celcius)",
y = "Count" )
Facet Layer:
The facet layer is used to split the data into subsets of the entire dataset. It allows the subsets to be visualized on the same plot. Here we separate plots according to month.
# Subset dataset to indlude only months 9-11
<- Yale.Myers.sub %>% filter( month > '08' & month < '12')
Yale.Myers.fall
<- ggplot(data = Yale.Myers.fall, aes(x = date , y = tmean)) + geom_point()
p
+ facet_grid(month ~ .) +
p labs(title = 'Climate of Yale Myers Forest',
x = "Date",
y = "Mean Air Temperature (Degrees Celcius)")
Statistics layer:
We can transform our data using binning, smoothing, descriptive, or intermediate. We will explore how February temperatures have changed from 1990 - 2018
#Subset the data:
<- Yale.Myers.sub %>% filter( month =='02')
Yale.Myers.feb
ggplot(data = Yale.Myers.feb, aes(x = date , y = tmean)) + geom_point() +
stat_smooth(method = lm, col = "red") +
labs(title = 'Climate of Yale Myers Forest')
`geom_smooth()` using formula = 'y ~ x'
Coordinates layer:
Data coordinates are mapped together to the mentioned plane of the graphic and we adjust the axis and change the spacing of the displayed data with control plot dimensions.
ggplot(data = Yale.Myers.fall, aes(x = tmin , y = tmean)) + geom_point() +
stat_smooth(method = lm, col = "red") +
labs(title = 'Climate of Yale Myers Forest') +
scale_y_continuous(limits = c(-10, 40), expand = c(0, 0)) +
scale_x_continuous(limits = c(-10, 40), expand = c(0, 0)) +
coord_equal() +
labs(title = 'Climate of Yale Myers Forest')
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 13 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).
coord_cartesian()
to proper zoom in:
# Add coord_cartesian() to proper zoom in
ggplot(data = Yale.Myers.fall, aes(x = tmin , y = tmean)) + geom_point() +
stat_smooth(method = lm, col = "red") +
labs(title = 'Climate of Yale Myers Forest') +
coord_cartesian(xlim = c(-10, 10))
`geom_smooth()` using formula = 'y ~ x'
Theme Layer:
Layer controls the finer points of display, like the font size and background color properties.
Example: Theme layer – element_rect()
function:
ggplot(data = Yale.Myers.fall, aes(x = tmin , y = tmean)) + geom_point() +
stat_smooth(method = lm, col = "red") +
facet_grid(. ~ month) +
theme(plot.background = element_rect(fill = "lightblue", colour = "gray"))
`geom_smooth()` using formula = 'y ~ x'
Introduction to Writing Functions in R
Functions are essential tools in R. Functions eliminate repetition from your code, which can reduce your workload, and help you to avoid errors. Here’s what you need to know about creating and calling them. It is very important to understand the purpose and syntax of R functions and knowing how to create or use them. In this tutorial, we’ll learn what an R function is, what types of functions exist in R, when we should use a function, how to create and call a user-defined function.
What Is a function in R?
In R, a function is a set of instructions that performs a specific task. Functions are essential for code organization, reusability, and efficiency. R comes with a variety of built-in functions, and you can also create your own custom functions. The main purpose of creating a user-defined function is to optimize scripts and avoid repetition. A good practice is creating a function whenever you need to run a certain set of commands more than twice.
Built-in Functions in R
Some of the most popular built in function are:
min()
,max()
,mean()
,median()
: returns the minimum / maximum / mean / median value of a numeric vectorsum()
: returns the sum of a numeric vectorrange()
: returns the minimum and maximum values of a numeric vectorabs()
: returns the absolute value of a numberstr()
: shows the structure of an R objectprint()
: displays an R object on the consolencol()
: returns the number of columns of a matrix or a dataframelength()
: returns the number of items in an R object (a vector, a list, etc.)nchar()
: returns the number of characters in a character objectsort()
: sorts a vector in ascending or descending (decreasing=TRUE) orderexists()
:returns TRUE or FALSE depending on whether or not a variable is defined in the R environment
Creating a Function in R
To create a user-defined function in R, we use the keyword function. The syntax is as follows:
function_name <- function(parameters){ function body }
1. Function Name
This is the name of the function object that will be stored in the R environment. After the function is defined, the function name is used for calling that function. It should be concise but clear and meaningful. It is common to use verbs in function names and it is also ok to use a noun if that noun is very descriptive and unambiguous.
2. Function Parameters
Function parameters are the variables in the function definition placed inside the parentheses and separated with a comma that will be set to actual values (called arguments) each time we call the function.
3. Function Body
The function body is a set of commands inside the curly braces that are run in a predefined order every time we call the function. In the function body, we place what exactly we need the function to do. For example, we can create a function to sum two numbers listed as parameters x and y:
<- function(x, y){
sum_two_nums + y
x
}
print(sum_two_nums(1, 2))
[1] 3
Next, lets create a function to calculate the circumference of a circle with a known radius. The function has only one parameter r.
<- function(radius){
circumference 2*pi*radius
}
After defining the function, call it with the radius equal to 2:
circumference( r=2)
[1] 12.56637
It’s possible, for a function to have no parameters:
<- function(){
hello_world 'Hello, World!'
}
print(hello_world())
[1] "Hello, World!"
Also, some parameters can be set to default values inside the function definition, which then can be reset when calling the function. In the circumference function, we can set the default radius of a circle to 1. If we call the function with no argument passed, it will calculate the circumference of a circle with a radius of 1. Otherwise, it will calculate the circumference of a circle with the provided radius:
<- function(r=1){
circumference 2*pi*r
}
print(circumference())
[1] 6.283185
print(circumference( r=2))
[1] 12.56637
Note that the statements in the function body should be indented by 2 or 4 spaces, depending on the IDE where we run the code. The important thing is to be consistent with the indentation throughout the program. While it doesn’t affect the code performance, it makes the code easier to read.
return()
As we saw from all the above examples, in R, it usually isn’t necessary to explicitly include the return statement when defining a function since an R function just automatically returns the last evaluated expression in the function body. However, we still can add the return statement inside the function body using the syntax return(expression_to_be_returned). This becomes important if we need to return more than one result from a function. For example:
<- function(vector){
mean_median <- mean(vector)
mean <- median(vector)
median return(c(mean, median))
}
print(mean_median(c(1, 1, 1, 2, 3)))
[1] 1.6 1.0
Note that in the return statement above, we actually return a vector containing the necessary results, and not just the variables separated by a comma. The return()
function can return only a single R object. Instead of a vector, we could also return a list, especially if the results to be returned are supposed to be of different data types.
Calling a Function in R
In all the above examples, we called the created functions by using the function name and adding the necessary arguments inside the parenthesis. In R, function arguments can be passed by position, by name (so-called named arguments), by mixing position-based and name-based matching, or by omitting the arguments altogether.
If we pass the arguments by position, we need to follow the same sequence of arguments as defined in the function:
<- function(x, y){
subtract_two_nums - y
x
}
print(subtract_two_nums(3, 1))
[1] 2
In the above example, x is equal to 3 and y to 1, and not vice versa.
If we pass the arguments by name, i.e., explicitly specify what value each parameter defined in the function takes, the order of the arguments doesn’t matter:
<- function(x, y){
subtract_two_nums - y
x
}print(subtract_two_nums(x=3, y=1))
[1] 2
print(subtract_two_nums(y=1, x=3))
[1] 2
Things to remember when using functions inside other functions: - Functions can have default values for arguments. - Functions can return multiple values using a list or other data structures. - You can use the return()
statement to specify the value to be returned by the function. - If you want to be able to use the function independent of another function, it should be created outside a function instead of nesting the functions.
Post-Workshop Assessment:
The standard error measures the precision of the estimate of the sample mean. Write a function to calculate the standard error. Use this function when comparing mean monthly temperature values between Yale-Myers and a chosen location.
Find a location of interest within the USA in google earth (https://earth.google.com/) and record the latitude and longitude in decimal degrees. Next, download Daymet data for that location. Subset tmean in both datasets, calculate the mean monthly and standard error of tmean (SE) for both locations. Create a figure comparing mean monthly and the standard error of tmean at both locations.
Create figures and text to show the following:
- Differences in monthly tmean, and standard error in tmean for both locations.
- Differences in monthly precipitation (prcp), and standard error in prcp for both locations.
- Differences in monthly minimum temperature (tmin), and standard error in tmin for both locations.
- Differences in monthly maximim temperature (tmax), and standard error in tmax for both locations.
- Why are the two locations similar or different?
- Create a report to upload to canvas (save as a pdf). Consider creating this report in Rmarkdown: https://hbctraining.github.io/Training-modules/Rmarkdown/lesson.html