2023-09-01
The most certain way to make mistakes when programming is typing everything.
Think of when you are writing a text, how many typos do you make? This will happen when you are coding!
Realizing there are typos or bigger mistakes in your code is more difficult than in a text to a friend.
So, be lazy and copy/paste codes when possible.
Also, don’t try to re-invent the wheel. Check on stack-overflow, bioconductor or github to see if other people already tried what you are doing and the solutions they came up with. You can always edit their code!
These are simply folders containing files with many functions.
They may contain data so you can test the functions.
If you never used the package, it probably is not installed in your computer, so you need to use the function “install.packages”
install.packages
just installs the package, but it does not load it (open in R). To load, use the function library
You cannot put multiple package names in the library function You need to load them one by one
Come back to this part once you are more comfortable with programming.
You cannot put a vector (ie. c("a", "b", "c")
) inside library()
, but you can write a code that loads the packages one by one without you writing that manually.
# start by creating a vector with all the packages you need
pkgs = c("rmarkdown", "tidyverse", "venn")
# We check which packages are NOT (!) installed
pkgs.To.Install = ! pkgs %in% installed.packages()
# any() checks if there is at least one TRUE in the vector
if (any(pkgs.To.Install)) install.packages(pkgs[pkgs.To.Install])
for (curPkg in pkgs) library(curPkg, character.only = T)
# curPkg is a variable that takes the value of each element in pkgs
# Every time the function library() is run, curPkg changes value
Let’s create a function and a vector of elements that we will use in the function
for
# To see an output, we would have to either print it or create an object to save it
defenseResults = c()
for (phd in candidates) { defenseResults[phd] = thesisDefense(phd) }
defenseResults
Serge Fadoua Arghavan Snehi
"Ph.D. Serge" "Ph.D. Fadoua" "Ph.D. Arghavan" "Ph.D. Snehi"
Basanta Sajjad Mahsa Archana
"Ph.D. Basanta" "Ph.D. Sajjad" "Ph.D. Mahsa" "Ph.D. Archana"
apply
functions
for
, but generates an outputapply
, sapply
, lapply
, mapply
, vapply
Serge Fadoua Arghavan Snehi
"Ph.D. Serge" "Ph.D. Fadoua" "Ph.D. Arghavan" "Ph.D. Snehi"
Basanta Sajjad Mahsa Archana
"Ph.D. Basanta" "Ph.D. Sajjad" "Ph.D. Mahsa" "Ph.D. Archana"
When you load the package “tidyverse”, it tells you that it loaded many others (packages from the universe tidy)
tidy
ier)factors
) tools to work with factors (tutorial)|>
or %>%
)Think of the pipe as a tube connecting two functions. When the first function is done, instead of presenting its output, you use it to start a new function.
phd("Karen") %>% postdoc
If you use |>
, you need to put the parenthesis in the function in front of it. Meaning:
phd("Karen") %>% postdoc
works same as phd("Karen") %>% postdoc()
and phd("Karen") |> postdoc()
phd("Karen") |> postdoc
does not workLet’s say you just want to check if R will read your file right or if you need to add more things to it before really working on the file.
Run the function str
on your input file or on the example from Class 0
You can load files from the internet without needing to download them to your computer!
# You could first put the path to the file in a variable, it would make things easier
myFile = "https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt"
str(read.delim(myFile))
'data.frame': 8 obs. of 7 variables:
$ X : chr "A" "B" "C" "D" ...
$ Control_1: num 0.1229 0.0907 0.1068 0.0161 0.1533 ...
$ Control_2: num 0.827 0.11 0.469 0.359 0.462 ...
$ Control_3: num 0.1486 0.3352 0.2419 0.0933 0.9419 ...
$ Treated_1: num 0.086 0.1403 0.1132 0.0271 0.3284 ...
$ Treated_2: num 0.3145 0.2441 0.2793 0.0352 0.0909 ...
$ Treated_3: num 0.22 0.433 0.326 0.107 0.316 ...
Using pipe, you can see the progression of the pipeline, eg.: bachelor("Karen") %>% phd() %>% postdoc
instead of postdoc(phd(bachelor("Karen")))
Basically, the first function written is the first one used, not the last.
readxl
package - open excel filesPart of the tidyverse
package
Allows you to:
excel_sheets(MY_EXCEL_FILE)
read_excel(file = MY_EXCEL_FILE, sheet = MY_SHEET)
read_excel(file = MY_EXCEL_FILE, sheet = MY_SHEET, na = ".")
header = T
, you use col_names = T
skip = N
n_max = N
range = "H40:T80"
range
is specified, skip
and n_max
are ignoredrange = "MY_SHEET!H40:T80"
Create a colored boxplot with the dataset iris, with nice axes titles
Code from the previous class
Use ggplot. The syntax may take a while to get used to, but it is easier to read.
# aes is short for aesthetics, basically what columns hold the values you want to plot
# fill is the color that will be inside the box, color is just for the border
ggplot(data = iris,
aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_boxplot() + # the type of plot you want
ylab("Sepal length (mm)") +
xlab("Species epithet")
Improve ggplot by setting a better theme (check here see the options)
# Set a theme for all future plots in this session
theme_set(theme_bw())
# Use different colors to fill and remove the legend
colors = c("red", "green", "blue")
ggplot(data = iris,
aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_boxplot(show.legend = F) + # the type of plot you want
ylab("Sepal length (mm)") +
xlab("Species epithet") +
scale_fill_manual(values = colors)
Graphics are layers of data and images put on top of each other. That is why the pieces of the ggplot function are connected by a +
.
The codes below construct the plot from the previous slide piece by piece.
labs
adds labels:
x
and y
- will add the labels to the axes (you can use the functions xlab
or ylab
instead)title
- Normally at the top of the figure (you can use the function ggtitle
instead)subtitle
- Goes under the title and is a bit smallercaption
- Normally goes at the bottom of the figure and is a small texttag
- the letter that goes at the top-left corner of a plot that is one part of several in a figureIn the example dataset, the names of the groups are in the middle not specified, they are inside the replicate name. We need one column with the values and one with the names of the treatments.
The table now is in the format “wide”, the one we want is called “long”
longData = pivot_longer(
data = rawData,
cols = !Measured, # gets all the columns of the table, except for the one in front of !
names_to = "Replicates", # name of the column that will contain column names from rawData
values_to = "Measurements"
)
str(longData)
tibble [48 × 3] (S3: tbl_df/tbl/data.frame)
$ Measured : chr [1:48] "A" "A" "A" "A" ...
$ Replicates : chr [1:48] "Control_1" "Control_2" "Control_3" "Treated_1" ...
$ Measurements: num [1:48] 0.123 0.827 0.149 0.086 0.314 ...
Let’s split the values from “Replicates” using str_split
from stringr
mutate
will return the input table with the new column we create
# pattern is what separates (_)
# i is the part that we want to see: Control_1 has 2 pieces, i=1 returns "Control"
longDataTreatments = longData %>%
mutate(Treatment = str_split_i(Replicates, pattern = "_", i = 1))
str(longDataTreatments)
tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
$ Measured : chr [1:48] "A" "A" "A" "A" ...
$ Replicates : chr [1:48] "Control_1" "Control_2" "Control_3" "Treated_1" ...
$ Measurements: num [1:48] 0.123 0.827 0.149 0.086 0.314 ...
$ Treatment : chr [1:48] "Control" "Control" "Control" "Treated" ...
geom_boxplot
geom_violin
You can put different types of plot and multiple data in the same graphic.
Over the violin or boxplot layer, you can add the points representing the value measured for each replicate, the mean, etc.
If you have few replicates (< 5), use a barplot with the standard deviation (make sure the color of the bar allows the visualization of the error bar)
If many replicates per group, use a boxplot or violin plot.
The code to get a violin plot is the same as the one for a boxplot, the only difference is the geom_violin.
# Set a theme for all future plots in this session
theme_set(theme_bw())
# Use different colors to fill and remove the legend
colors = c("red", "green", "blue")
ggplot(data = iris,
aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_violin(show.legend = F) + # violin instead of boxplot
ylab("Sepal length (mm)") +
xlab("Species epithet") +
scale_fill_manual(values = colors)
geom_col
or geom_bar(stat = "identity")
geom_errorbar(aes(ymin=Mean-sd, ymax=Mean+sd), width=.2)
+ coord_polar("y")
df <- data.frame(
group = c("Male", "Female", "Child"),
value = c(25, 25, 50)
)
# Barplot like geom_col
ggplot(df, aes(x = "", y = value, fill = group)) +
# x has to be "" - check how it looks if you put x = group
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 1) +
theme_void() # clean up grids, borders, and labels
You can use geom_bar(stat = "count")
if you do not have the number of occurrences of each category computed. This way, ggplot counts it and plots. - Eg. A table with metabolites detected separated by category: how many metabolites of each category were detected?
If you want proportions in %, use stat = "density"
instead of stat = "count"
stat = "identity"
# Datasets about US states
US_statesInfo = data.frame(Name = state.name,
Region = state.region,
Division = state.division)
x = "Number of states"
# Plot the number of states in each division
ggplot(US_statesInfo, aes(y = Division)) +
geom_bar(stat = "count") +
xlab(x)
# Plot the number of states in each region
ggplot(US_statesInfo, aes(y = Region)) +
geom_bar() +
xlab(x)
# Plot the number of states in each division, and color by the region
ggplot(US_statesInfo, aes(y = Division, fill = Region)) +
geom_bar(stat = "count") +
xlab(x)
Line graph - geom_line
Let’s say we want to see how the labs of 2 PIs have grown across the years and compare the two.
# Let's create a time series to plot
LabSize = data.frame(Isabel = c(0, 3, 5, 10, 30),
Hugo = c(2, 2, 6, 9, 9),
Year = seq(2014, 2022, 2))
LabSize.Long = pivot_longer(LabSize,
cols = !Year, # all columns from LabSize, except "Year"
names_to = "PI",
values_to = "LabMembers")
ggplot(LabSize.Long,
aes(x = Year, y = LabMembers, color = PI)) +
geom_line()
We will use the time series data
'data.frame': 12 obs. of 10 variables:
$ X : chr "Day_1" "Day_2" "Day_3" "Day_4" ...
$ Control_3 : num 0 1.96 3.98 6.18 7.52 ...
$ Control_2 : num 0 2.04 4.02 6 8 ...
$ Control_1 : num 0 2 4 5.82 8.48 ...
$ TreatmentA_1: num 0 2.94 6.03 8.73 12 ...
$ TreatmentA_2: num 0 3 6 9.27 11.28 ...
$ TreatmentA_3: num 0 3.06 5.97 9 12.72 ...
$ TreatmentB_3: num 0 1.53 3 4.5 6.36 ...
$ TreatmentB_2: num 0 1.47 2.98 4.37 5.64 ...
$ TreatmentB_1: num 0 1.5 3.02 4.63 6 ...
Let’s transform the data to format long and add a column with the name of the treatment group
library(tidyverse)
timeSeriesLong = pivot_longer(
timeSeries,
cols = !TimePoint,
names_to = c("Treatment", "Replicates"),
names_sep = "_",
values_to = "Growth_measure"
)
head(timeSeriesLong)
# A tibble: 6 × 4
TimePoint Treatment Replicates Growth_measure
<dbl> <chr> <chr> <dbl>
1 1 Control 3 0
2 1 Control 2 0
3 1 Control 1 0
4 1 TreatmentA 1 0
5 1 TreatmentA 2 0
6 1 TreatmentA 3 0
Now let’s create another table with the mean, standard deviation and coefficient of variation of each treatment per time point
For the error bars, we cannot simply say “here, this is the standard deviation”.
We need to tell it “use this value as maximum and this as minimum”
See that we set “ymin” and “ymax”. If you have standard deviation for the x-axis, you can add it with “xmin” and “xmax”.
# Create the linear model of Growth_measure ~ TimePoint, adding the Treatment interaction
linear.model = lm(Growth_measure ~ TimePoint*Treatment, data = timeSeriesLong)
summary(linear.model)
Call:
lm(formula = Growth_measure ~ TimePoint * Treatment, data = timeSeriesLong)
Residuals:
Min 1Q Median 3Q Max
-1.80 -0.06 0.00 0.06 1.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.00000 0.14189 -14.095 < 2e-16 ***
TimePoint 2.00000 0.01928 103.740 < 2e-16 ***
TreatmentTreatmentA -1.00000 0.20066 -4.984 2.56e-06 ***
TreatmentTreatmentB 0.50000 0.20066 2.492 0.0143 *
TimePoint:TreatmentTreatmentA 1.00000 0.02726 36.678 < 2e-16 ***
TimePoint:TreatmentTreatmentB -0.50000 0.02726 -18.339 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3993 on 102 degrees of freedom
Multiple R-squared: 0.9979, Adjusted R-squared: 0.9978
F-statistic: 9800 on 5 and 102 DF, p-value: < 2.2e-16
package 'lsmeans' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Karen\AppData\Local\Temp\RtmpEjEd3Q\downloaded_packages
linear.model.trends <- lstrends(linear.model, "Treatment", var="TimePoint")
summary(linear.model.trends)
Treatment TimePoint.trend SE df lower.CL upper.CL
Control 2.0 0.0193 102 1.96 2.04
TreatmentA 3.0 0.0193 102 2.96 3.04
TreatmentB 1.5 0.0193 102 1.46 1.54
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
Control - TreatmentA -1.0 0.0273 102 -36.678 <.0001
Control - TreatmentB 0.5 0.0273 102 18.339 <.0001
TreatmentA - TreatmentB 1.5 0.0273 102 55.016 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
Tuesdays group
Fridays group