R class - packages and functions

Karen Cristine Goncalves, Ph.D.

2023-09-01

Be lazy!

The most certain way to make mistakes when programming is typing everything.

Think of when you are writing a text, how many typos do you make? This will happen when you are coding!

Realizing there are typos or bigger mistakes in your code is more difficult than in a text to a friend.

So, be lazy and copy/paste codes when possible.

Also, don’t try to re-invent the wheel. Check on stack-overflow, bioconductor or github to see if other people already tried what you are doing and the solutions they came up with. You can always edit their code!

Packages

These are simply folders containing files with many functions.

They may contain data so you can test the functions.

How to use packages

If you never used the package, it probably is not installed in your computer, so you need to use the function “install.packages”

install.packages("vegan") # a package with functions for multivariate stats functions

# You can install many packages at once
install.packages(c("venn", "tidyverse"))

install.packages just installs the package, but it does not load it (open in R). To load, use the function library

library(cluster)
library("tidyverse") # you do not need the quotes with the function library

You cannot put multiple package names in the library function You need to load them one by one

Come back to this part once you are more comfortable with programming.

You cannot put a vector (ie. c("a", "b", "c")) inside library(), but you can write a code that loads the packages one by one without you writing that manually.

# start by creating a vector with all the packages you need
pkgs = c("rmarkdown", "tidyverse", "venn")

# We check which packages are NOT (!) installed
pkgs.To.Install = ! pkgs %in% installed.packages()

# any() checks if there is at least one TRUE in the vector
if (any(pkgs.To.Install)) install.packages(pkgs[pkgs.To.Install])

for (curPkg in pkgs) library(curPkg, character.only = T) 
# curPkg is a variable that takes the value of each element in pkgs
# Every time the function library() is run, curPkg changes value

loops

Let’s create a function and a vector of elements that we will use in the function

thesisDefense = function(x) {return(paste("Ph.D.", x))}
candidates = c("Serge", "Fadoua", "Arghavan", 
           "Snehi", "Basanta", "Sajjad", 
           "Mahsa", "Archana")
  • for
    • Function that will repeat what you ask for everything in the group you give it.
      Eg: every PhD candidate has to defend their theses: - Useful when you do not want to save the results or when you need to save separated objects for each elements in the loop (each candidate below)
for (phd in candidates) { thesisDefense(phd) }
# To see an output, we would have to either print it or create an object to save it
defenseResults = c()
for (phd in candidates) { defenseResults[phd] = thesisDefense(phd) }
defenseResults
           Serge           Fadoua         Arghavan            Snehi 
   "Ph.D. Serge"   "Ph.D. Fadoua" "Ph.D. Arghavan"    "Ph.D. Snehi" 
         Basanta           Sajjad            Mahsa          Archana 
 "Ph.D. Basanta"   "Ph.D. Sajjad"    "Ph.D. Mahsa"  "Ph.D. Archana" 
  • apply functions
    • Similar to for, but generates an output
    • Useful when you want a single output for each element of the loop
    • Several options: apply, sapply, lapply, mapply, vapply
defenseResults = sapply(candidates, \(phd) thesisDefense(phd) )

defenseResults
           Serge           Fadoua         Arghavan            Snehi 
   "Ph.D. Serge"   "Ph.D. Fadoua" "Ph.D. Arghavan"    "Ph.D. Snehi" 
         Basanta           Sajjad            Mahsa          Archana 
 "Ph.D. Basanta"   "Ph.D. Sajjad"    "Ph.D. Mahsa"  "Ph.D. Archana" 

Tidyverse - many packages that make your life easier

When you load the package “tidyverse”, it tells you that it loaded many others (packages from the universe tidy)

  1. Tables and data.frames
    • dplyr: many functions to work with tables
    • tibble: construct data frames (tables)
    • tidyr: tools to better organize data (make it tidyier)
    • readr: fast and friendly functions to read table data
  2. Text-like data
    • forcats: (anagram of factors) tools to work with factors (tutorial)
    • stringr: work with text (character variables)
    • lubridate: work with dates
  3. Plots
    • ggplot2: create better plots
  4. Programming

Pipe (|> or %>%)

Think of the pipe as a tube connecting two functions. When the first function is done, instead of presenting its output, you use it to start a new function.

  • Eg. To start a postdoc, I had to finish my Ph.D: phd("Karen") %>% postdoc

If you use |>, you need to put the parenthesis in the function in front of it. Meaning:

  • phd("Karen") %>% postdoc works same as phd("Karen") %>% postdoc() and phd("Karen") |> postdoc()
  • phd("Karen") |> postdoc does not work

Let’s say you just want to check if R will read your file right or if you need to add more things to it before really working on the file.

Run the function str on your input file or on the example from Class 0

You can load files from the internet without needing to download them to your computer!

# You could first put the path to the file in a variable, it would make things easier
myFile = "https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt"

str(read.delim(myFile))
'data.frame':   8 obs. of  7 variables:
 $ X        : chr  "A" "B" "C" "D" ...
 $ Control_1: num  0.1229 0.0907 0.1068 0.0161 0.1533 ...
 $ Control_2: num  0.827 0.11 0.469 0.359 0.462 ...
 $ Control_3: num  0.1486 0.3352 0.2419 0.0933 0.9419 ...
 $ Treated_1: num  0.086 0.1403 0.1132 0.0271 0.3284 ...
 $ Treated_2: num  0.3145 0.2441 0.2793 0.0352 0.0909 ...
 $ Treated_3: num  0.22 0.433 0.326 0.107 0.316 ...

Using pipe, you can see the progression of the pipeline, eg.:
bachelor("Karen") %>% phd() %>% postdoc
instead of
postdoc(phd(bachelor("Karen")))

Basically, the first function written is the first one used, not the last.

# You can also use a pipe ( %>% )
myFile = "https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt"

read.delim(myFile) %>% str

readxl package - open excel files

Part of the tidyverse package

Allows you to:

  • Check the names of the sheets of an excel file: excel_sheets(MY_EXCEL_FILE)
  • Read sheets of an excel file as tibbles (a type of data.frame):
    • read_excel(file = MY_EXCEL_FILE, sheet = MY_SHEET)
    • You can specify what defines NA: read_excel(file = MY_EXCEL_FILE, sheet = MY_SHEET, na = ".")
    • Instead of header = T, you use col_names = T
    • You can specify the number of rows to skip in the beginning with skip = N
    • You can specify the number of rows to read with n_max = N
    • You can specify which cols and rows to read with range = "H40:T80"
      • If range is specified, skip and n_max are ignored
      • You can include the name of the sheet: range = "MY_SHEET!H40:T80"
        • SHEET EXCLAMATION_POINT CELL_RANGE

ggplot2

Create a colored boxplot with the dataset iris, with nice axes titles

Code from the previous class

colors = c("red", "green", "blue")
boxplot(Sepal.Length ~ Species, 
    data = iris, col = colors, 
    ylab = "Sepal length (mm)", xlab = "Species epithet")

Use ggplot. The syntax may take a while to get used to, but it is easier to read.

# aes is short for aesthetics, basically what columns hold the values you want to plot
# fill is the color that will be inside the box, color is just for the border
ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)) +
    geom_boxplot() + # the type of plot you want
    ylab("Sepal length (mm)") +
    xlab("Species epithet")

Improve ggplot by setting a better theme (check here see the options)

# Set a theme for all future plots in this session
theme_set(theme_bw()) 

# Use different colors to fill and remove the legend
colors = c("red", "green", "blue")

ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)) +
    geom_boxplot(show.legend = F) + # the type of plot you want
    ylab("Sepal length (mm)") +
    xlab("Species epithet") +
    scale_fill_manual(values = colors)

ggplot - part 2

Graphics are layers of data and images put on top of each other. That is why the pieces of the ggplot function are connected by a +.

The codes below construct the plot from the previous slide piece by piece.

# Set a theme for all future plots in this session
theme_set(theme_bw()) 

# Use different colors to fill and remove the legend
(plot1 = ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)))

(plot2 = plot1 +
    geom_boxplot(show.legend = F)) # the type of plot you want

labs adds labels:

  • x and y - will add the labels to the axes (you can use the functions xlab or ylab instead)
  • title - Normally at the top of the figure (you can use the function ggtitle instead)
  • subtitle - Goes under the title and is a bit smaller
  • caption - Normally goes at the bottom of the figure and is a small text
  • tag - the letter that goes at the top-left corner of a plot that is one part of several in a figure
(plot3 = plot2 +
    labs(y = "Sepal length (mm)", x = "Species epithet"))

plot3

colors = c("red", "green", "blue")

plot3 + scale_fill_manual(values = colors)

dplyr: prepare your data for ggplot

In the example dataset, the names of the groups are in the middle not specified, they are inside the replicate name. We need one column with the values and one with the names of the treatments.

myFile = "https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt"
rawData = read.delim(myFile)
names(rawData)
[1] "X"         "Control_1" "Control_2" "Control_3" "Treated_1" "Treated_2"
[7] "Treated_3"
# Let's change x to "Measured"
names(rawData)[1] = "Measured"

The table now is in the format “wide”, the one we want is called “long”

longData = pivot_longer(
    data = rawData,
    cols = !Measured, # gets all the columns of the table, except for the one in front of !
    names_to = "Replicates", # name of the column that will contain column names from rawData
    values_to = "Measurements"
    )

str(longData)
tibble [48 × 3] (S3: tbl_df/tbl/data.frame)
 $ Measured    : chr [1:48] "A" "A" "A" "A" ...
 $ Replicates  : chr [1:48] "Control_1" "Control_2" "Control_3" "Treated_1" ...
 $ Measurements: num [1:48] 0.123 0.827 0.149 0.086 0.314 ...

Let’s split the values from “Replicates” using str_split from stringr

mutate will return the input table with the new column we create

# pattern is what separates (_)
# i is the part that we want to see: Control_1 has 2 pieces, i=1 returns "Control"

longDataTreatments = longData %>%
    mutate(Treatment = str_split_i(Replicates, pattern = "_", i = 1))

str(longDataTreatments)
tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
 $ Measured    : chr [1:48] "A" "A" "A" "A" ...
 $ Replicates  : chr [1:48] "Control_1" "Control_2" "Control_3" "Treated_1" ...
 $ Measurements: num [1:48] 0.123 0.827 0.149 0.086 0.314 ...
 $ Treatment   : chr [1:48] "Control" "Control" "Control" "Treated" ...
longDataTreatments %>%
    ggplot(aes(x = Treatment, y = Measurements, fill = Measured)) +
    geom_boxplot()

longDataTreatments %>%
    ggplot(aes(x = Measured, y = Measurements, fill = Treatment)) +
    geom_boxplot()

Types of plots - sample distribution

  1. Boxplot - check this slide
    • geom_boxplot
    • Allows the visual comparison of groups, like a bar chart, as well as the distribution of the replicates
      • You can see the minimum, maximum and the median of the sample, as well as the outliers
  2. Violin plot
    • geom_violin
    • Same as boxplot, but without the quantiles drawn

You can put different types of plot and multiple data in the same graphic.

Over the violin or boxplot layer, you can add the points representing the value measured for each replicate, the mean, etc.

If you have few replicates (< 5), use a barplot with the standard deviation (make sure the color of the bar allows the visualization of the error bar)

If many replicates per group, use a boxplot or violin plot.

Example violin plot

The code to get a violin plot is the same as the one for a boxplot, the only difference is the geom_violin.

# Set a theme for all future plots in this session
theme_set(theme_bw()) 

# Use different colors to fill and remove the legend
colors = c("red", "green", "blue")

ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)) +
    geom_violin(show.legend = F) + # violin instead of boxplot
    ylab("Sepal length (mm)") +
    xlab("Species epithet") +
    scale_fill_manual(values = colors)

Types of plots - Bar chart

  • Uses
    • Visually compare means when the number of replicates is low
    • Visually compare counts (number of occurences) - number of students per lab
  • How
    • Use either geom_col or geom_bar(stat = "identity")
    • Add standard deviation bar with geom_errorbar
      • geom_errorbar(aes(ymin=Mean-sd, ymax=Mean+sd), width=.2)
df <- data.frame(
  group = c("Male", "Female", "Child"),
  value = c(25, 25, 50)
  )

# Barplot like geom_col
ggplot(df, aes(x = group, y = value, fill = group)) +
    geom_col(width = 1)

  • identity means that the size of the bar will be the value you put instead of calculating something
df <- data.frame(
  group = c("Male", "Female", "Child"),
  value = c(25, 25, 50)
  )

# Barplot like geom_col
ggplot(df, aes(x = group, y = value, fill = group)) +
    geom_bar(width = 1,  stat = "identity")

  • Using either geom_bar or geom_col, you can make a pie chart with + coord_polar("y")
df <- data.frame(
  group = c("Male", "Female", "Child"),
  value = c(25, 25, 50)
  )

# Barplot like geom_col
ggplot(df, aes(x = "", y = value, fill = group)) +
    # x has to be "" - check how it looks if you put x = group
    geom_bar(width = 1,  stat = "identity") +
    coord_polar("y", start = 1) +
    theme_void() # clean up grids, borders, and labels

Types of plot - proportions

You can use geom_bar(stat = "count") if you do not have the number of occurrences of each category computed. This way, ggplot counts it and plots. - Eg. A table with metabolites detected separated by category: how many metabolites of each category were detected?

If you want proportions in %, use stat = "density" instead of stat = "count"

  • With geom_bar, you can have the same plot as with geom_col if you use stat = "identity"
  • identity means that the size of the bar will be the value you put instead of calculating something
# Datasets about US states
US_statesInfo = data.frame(Name = state.name,
               Region = state.region,
               Division = state.division)
x = "Number of states"
# Plot the number of states in each division
ggplot(US_statesInfo, aes(y = Division)) +
    geom_bar(stat = "count") +
    xlab(x)

# Plot the number of states in each region
ggplot(US_statesInfo, aes(y = Region)) +
    geom_bar() +
    xlab(x)

# Plot the number of states in each division, and color by the region
ggplot(US_statesInfo, aes(y = Division, fill = Region)) +
    geom_bar(stat = "count") +
    xlab(x)

Types of plots - part 3

Line graph - geom_line

  • Visualize data across time - points are connected because they are the same sample at different times.

Let’s say we want to see how the labs of 2 PIs have grown across the years and compare the two.

# Let's create a time series to plot

LabSize = data.frame(Isabel = c(0, 3, 5, 10, 30),
             Hugo = c(2, 2, 6, 9, 9),
             Year = seq(2014, 2022, 2))
LabSize.Long = pivot_longer(LabSize,
                cols = !Year, # all columns from LabSize, except "Year"
                names_to = "PI", 
                values_to = "LabMembers")

ggplot(LabSize.Long, 
       aes(x = Year, y = LabMembers, color = PI)) +
    geom_line()

Advanced line graph

We will use the time series data

# Load the data
timeSeries.File = "https://karengoncalves.github.io/Programming_classes/exampleData/TimeSeries_example.csv"
timeSeries = read.csv(timeSeries.File)
str(timeSeries)
'data.frame':   12 obs. of  10 variables:
 $ X           : chr  "Day_1" "Day_2" "Day_3" "Day_4" ...
 $ Control_3   : num  0 1.96 3.98 6.18 7.52 ...
 $ Control_2   : num  0 2.04 4.02 6 8 ...
 $ Control_1   : num  0 2 4 5.82 8.48 ...
 $ TreatmentA_1: num  0 2.94 6.03 8.73 12 ...
 $ TreatmentA_2: num  0 3 6 9.27 11.28 ...
 $ TreatmentA_3: num  0 3.06 5.97 9 12.72 ...
 $ TreatmentB_3: num  0 1.53 3 4.5 6.36 ...
 $ TreatmentB_2: num  0 1.47 2.98 4.37 5.64 ...
 $ TreatmentB_1: num  0 1.5 3.02 4.63 6 ...
# Let's rename the first column that indicate the time points
names(timeSeries)[1] = "TimePoint" 
timeSeries$TimePoint = 
    gsub("Day_", "", timeSeries$TimePoint) |>
    as.numeric()

Let’s transform the data to format long and add a column with the name of the treatment group

library(tidyverse)

timeSeriesLong = pivot_longer(
    timeSeries,
    cols = !TimePoint, 
    names_to = c("Treatment", "Replicates"),
    names_sep = "_",
    values_to = "Growth_measure"
)

head(timeSeriesLong)    
# A tibble: 6 × 4
  TimePoint Treatment  Replicates Growth_measure
      <dbl> <chr>      <chr>               <dbl>
1         1 Control    3                       0
2         1 Control    2                       0
3         1 Control    1                       0
4         1 TreatmentA 1                       0
5         1 TreatmentA 2                       0
6         1 TreatmentA 3                       0
ggplot(timeSeriesLong) +
    geom_line(aes(x = TimePoint, y = Growth_measure, 
              group = Treatment, 
              color = Treatment)) +
    scale_x_continuous(breaks = 1:12)

Now let’s create another table with the mean, standard deviation and coefficient of variation of each treatment per time point

GrowthMeasureStats = timeSeriesLong %>%
    group_by(Treatment, TimePoint) %>%
    summarise(
        Mean = mean(Growth_measure),
        StdEnv = sd(Growth_measure)
    )

For the error bars, we cannot simply say “here, this is the standard deviation”.

We need to tell it “use this value as maximum and this as minimum”

See that we set “ymin” and “ymax”. If you have standard deviation for the x-axis, you can add it with “xmin” and “xmax”.

# We put inside ggplot() what is common to all layers
p1 = ggplot(GrowthMeasureStats, 
       aes(x = TimePoint, 
           color = Treatment)) +
    geom_line(aes(y = Mean)) 
p1 

p1 + geom_errorbar(aes(ymin = Mean - StdEnv,
               ymax = Mean + StdEnv),
           width = 0.2) +
    scale_x_continuous(breaks = 1:12)

Test if the curves are different

# Create the linear model of Growth_measure ~ TimePoint, adding the Treatment interaction
linear.model = lm(Growth_measure ~ TimePoint*Treatment, data = timeSeriesLong)
summary(linear.model)

Call:
lm(formula = Growth_measure ~ TimePoint * Treatment, data = timeSeriesLong)

Residuals:
   Min     1Q Median     3Q    Max 
 -1.80  -0.06   0.00   0.06   1.80 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -2.00000    0.14189 -14.095  < 2e-16 ***
TimePoint                      2.00000    0.01928 103.740  < 2e-16 ***
TreatmentTreatmentA           -1.00000    0.20066  -4.984 2.56e-06 ***
TreatmentTreatmentB            0.50000    0.20066   2.492   0.0143 *  
TimePoint:TreatmentTreatmentA  1.00000    0.02726  36.678  < 2e-16 ***
TimePoint:TreatmentTreatmentB -0.50000    0.02726 -18.339  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3993 on 102 degrees of freedom
Multiple R-squared:  0.9979,    Adjusted R-squared:  0.9978 
F-statistic:  9800 on 5 and 102 DF,  p-value: < 2.2e-16
install.packages("lsmeans"); library("lsmeans")
package 'lsmeans' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Karen\AppData\Local\Temp\RtmpEjEd3Q\downloaded_packages
linear.model.trends <- lstrends(linear.model, "Treatment", var="TimePoint")
summary(linear.model.trends)
 Treatment  TimePoint.trend     SE  df lower.CL upper.CL
 Control                2.0 0.0193 102     1.96     2.04
 TreatmentA             3.0 0.0193 102     2.96     3.04
 TreatmentB             1.5 0.0193 102     1.46     1.54

Confidence level used: 0.95 
pairs(linear.model.trends)
 contrast                estimate     SE  df t.ratio p.value
 Control - TreatmentA        -1.0 0.0273 102 -36.678  <.0001
 Control - TreatmentB         0.5 0.0273 102  18.339  <.0001
 TreatmentA - TreatmentB      1.5 0.0273 102  55.016  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates 

References

Code from class

Tuesdays group

  1. September 19th
  2. September 19th (2)
  3. October 3rd
  4. October 10th
  5. October 17th

Fridays group

  1. September 15th
  2. September 29th
  3. October 6th
  4. October 19th