However, most packages for bioinformatics are not downloadable with a simple install.packages, as they are not stored in CRAN, the basic website for R packages.
Instead, they are in Bioconductor, and some are in specific github repositories (think about the R_class_examples).
KarenGoncalves/ is my folder in the website, in there I have over 20 folders, one for each project.
R_class_examples is one of them.
if (!"devtools"%in%installed.packages()) {install.packages("devtools")}repos.github =c("YuLab-SMU/ggmsa")pkgs.github =gsub(".+/", "", repos.github) # remove the folder namepkgs.To.Install =which(! pkgs.github %in%installed.packages())if (length(pkgs.To.Install) !=0) {# To install the package, we need the folder + package name devtools::install_github(repos.github[pkgs.To.Install]) }
Copy the code below to load all the necessary packages.
In future, just come back here, copy the code below and change the list of packages.
So you just need to give the link to the function devtools::source_gist() and then you can use the function install_from_dif_sources
A note about classes
R objects (the things in the environment tab) belong to different classes. The basic ones you already know:
vectors have one dimension and data inside has to be all the same type: character vector; numeric vector; factor vector
matrices: vectors, but with columns and rows
data.frames: table made of vectors (columns) of the same size
lists: anything goes! If the objects stored in it have names, you can access them like list$object_name, as if the object inside was a column of a table
To check the class of an object you can use:
class(object)
A package may create it’s own object class. Sometimes you need to convert an object from one class to another, like:
(numeric_vector <-1:10)
[1] 1 2 3 4 5 6 7 8 9 10
class(numeric_vector)
[1] "integer"
(character_vector <-as.character(numeric_vector)) # change class
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
class(character_vector)
[1] "character"
So, just remember that, if a function starts with as., it is changing the class of the object.
Reading and writing fasta files
If you have protein sequences in a fasta file, copy it to your inputs folder, else use the following: Reductases.fasta
# To open a fasta file with multiple sequences, use the function readDNAStringSet() or readAAStringSet()fasta_input ="https://karengoncalves.github.io/Programming_classes/exampleData/Reductases.fasta"%>%readAAStringSet(format ="fasta")head(fasta_input)
# If you want to select just one or two sequences and export it, you can do:writeXStringSet(fasta_input[1:3], filepath ="input_subset.fasta", format ="fasta")
Multiple Sequence Alignment (MSA)
Make things easy for yourself later, add the names of the parameters of each function so you know what you need to put next time.
There are 4 functions for MSA:
msaClustalW - ClustalW is a general purpose multiple alignment program for DNA or proteins. 1
cluster parameter can be either Neighbor Joining ("nj") or Unweighted Pair Group Method with Arithmetic mean ("upgma")
Do not use "upgma" if you are on windows and your R version is 4.x (x being any number) - R crashes
msaClustalOmega - New MSA tool that uses seeded guide trees and HMM profile-profile techniques to generate alignments. Suitable for medium-large alignments.2
default substitutionMatrix: "Gonnet"; other options: "BLOSUM30", "BLOSUM40, "BLOSUM50", "BLOSUM65" and "BLOSUM80".
msaMuscle - Accurate MSA tool, especially good with proteins. Suitable for medium alignments.2
cluster parameter can be either NJ ("neighborjoining") or UPGMA ("upgma") - UPGMA gives better results 3
msa - you can set the method with the argument method or just put the sequences and use the default ("Clustalw") (other options: "ClustalOmega", "Muscle")
Add verbose = T so that the parameters used are printed to the screen and you can take note for your publications.
Running MSA
# The dataset used as example has 48 sequences, we will subset it for the alignment, selecting only 10fasta_for_alignment <- fasta_input[1:10]# use ?msa to know what you need to put in the function and what the default values aremyFirstAlignment <-msa(fasta_for_alignment, method ="ClustalOmega",verbose = T)
Just calling the MSA result will not show the whole alignment. For that you can use print()
print(myFirstAlignment, show ="complete", showConsensus = T)
Use ggmsa for visualization
To see the alignment as a plot, we can use ggmsa.
We need to change the type of object of the alignment result, as ggmsa only accepts "AAMultipleAlignment" objects
# The original result is of class MsaAAMultipleAlignmentmyFirstAlignment2 <- myFirstAlignment # keep the original, make a copy to modify itclass(myFirstAlignment2) <-"AAMultipleAlignment"# We will plot only a part of the alignment - from the AA 220 to the AA 250 - so it is visible ggmsa(myFirstAlignment2, start =220, end =250, seq_name = T)
# We can see also the consensus highlightedggmsa(myFirstAlignment2, start =220, end =250, seq_name = T, consensus_views = T)
# Remove the # below if you want to save the figure#ggsave("First_alignment_consensus_220_250.tiff")
We can set the width of the characters (use char_width = #), so the residues are not so big
ggmsa(myFirstAlignment2, start =220, end =250, seq_name = T, char_width =0.5)
A seqlogo is a plot with letters on top of each other representing their frequency at each position in the alignment. The bigger a letter, the more frequent it is at a given position.
We can add this plot with + geom_seqlogo()
you can use different color schemes with geom_seqlogo(color = #)1
you can set your color scheme with geom_seqlogo(custom_color = data.frame)1
Learn what is necessary to make a tree, how to read it and what conclusions you can and cannot draw from them.
Sequences used for the tree are the leaves/tips.
The point where the tree starts (the center if it is circular, or the first tip to separate from the rest of the tree) is the root
The lines drawing the tree are the branches/edges.
Going from the leaves to the root, the point where two branches connect is called a node
Compute the distance matrix
We will use ape and seqinr to create a tree from the alignment. But we need to, again, transform the type of object we got from the msa.
Then we calculate a distance matrix.
A distance matrix shows how different each sequence is from the others. It has as many rows as columns, since every sequence is compared to the others.
Its diagonal is all 0s, since the difference between a sequence and itself is 0.
The other rows and columns are values that indicate the amount of mismatches between two sequences. The greater the number, the less similar the two sequences are.
# We need to first export the alignment and import it again to transform it myFirstAlignment3 <- myFirstAlignment %>%as.phyDat()# Then, we create a distance matrixdistance_alignment <-dist.ml(x = myFirstAlignment3, model ="Blosum62")
Tree computation
Then, we use the distance matrix to compute the tree, joining sequences most similar, then joining groups of sequences. Each method of computing will do this joining differently.
The two FastME (fast minimum evolution) functions (balanced - bal - and ordinary least-squares - ols) perform the minimum evolution algorithm of Desper and Gascuel (2002).
upgma_tree <-upgma(distance_alignment)# We save the tree in a object so we can access the leaf labes with $tip.label
We will use a function called parse to italicize only the species name. It evaluates the text and executes functions inside it! Example, in:
"'TYDC '~italic('Narcissus pseudonarcissus')"
The texts 'TYDC ' and 'Narcissus pseudonarcissus' (in italic) are joined by the symbol ~.
In this case, the protein name has just letters and numbers and is at the start of the sequence name, followed by the species name. So we can use gsub to create the label we want.
Check the result before plotting. See that your result looks like the example with TYDC above, as you need to use both double and single quotes.
ggtree(upgma_tree) +# use geom_tiplab to plot the seqnames# add parse = T so R executes the function inside the textgeom_tiplab(parse = T) +xlim(c(0, 2))
# the upper x limit may change depending on the length of your labels, play around until you get the whole label but little empty space
More customization
Let’s say you want to add bars marking different groups in a tree, like the vertical bars marking the clades in the figure below (Figure 3 in Hori et al. 2006):
In our example tree, there are 2 main groups: nodes 12 and 13. We can use the functiongeom_cladelabel to mark each one
There is also geom_cladelab, but you cannot change the color of the text with it
ggtree(upgma_tree) +# we do not need the line below anymore#geom_text(aes(label = node), hjust = -.2) +geom_tiplab(parse = T) +geom_cladelabel(node =12, label ="Node 12",color ="red", offset =1) +geom_cladelabel(node =13, label ="Node 13",color ="green4", offset =1,barsize =2) +xlim(c(0, 2.5))
You can draw a rectangle around the node and its branches using geom_highlight
ggtree(upgma_tree) +# we do not need the line below anymore#geom_text(aes(label = node), hjust = -.2) +geom_tiplab(parse = T) +geom_highlight(node =12,fill ="red") +# change the transparency of the rectangle with alphageom_cladelabel(node =13, label ="Node 13",color ="green4", offset =1,barsize =2) +xlim(c(0, 2.5))
Statistical support
The statistical support on phylogenetic trees is calculated with bootstrap. In simple terms, bootstrap is repetition. In more complicated terms, the tree computation (including the calculation of the distance matrix) is repeated n times1, and the result shows the proportion of times the node appeared.
If out of 1000 times, a given group appears 100 times, its support is 10%.
1 n is determined by the number of sequences you are working with.
n is the number of times the calculation was repeated, so it includes the first time. That is why you may see it reported as ‘bootstrap was performed with 999 replicates’ (plus the original tree calculation, you get 1 000 replicates)
Although more is better, using n=10 000 for a tree of 10 sequences is overkill and may take too much time to calculate.
The most typical n used (for < 50 sequences in a tree) is 999
Bootstrap support
Let’s consider that we want to use upgma to calculate our tree and we want to calculate the support of the branches.
ggtree(bs_plot2) +geom_tiplab(parse = T) +geom_highlight(node =12,fill ="red") +geom_nodelab(nudge_x =0.03) +# we use nudge_x to move the label away from the branchesxlim(c(0, 1.5))
We export the trees to a newick (.nwk) file
This format is readable by any program working with trees.