Multiple sequence alignment and phylogenetic trees

Karen Cristine Goncalves, Ph.D.

2024-01-01

Packages

Before, we installed packages with the code below

pkgs = c("rmarkdown", "tidyverse", "venn")
pkgs.To.Install = which(! pkgs %in% installed.packages())
if (length(pkgs.To.Install) != 0) install.packages(pkgs[pkgs.To.Install])
for (curPkg in pkgs) library(curPkg, character.only = T) 

However, most packages for bioinformatics are not downloadable with a simple install.packages, as they are not stored in CRAN, the basic website for R packages.

Instead, they are in Bioconductor, and some are in specific github repositories (think about the R_class_examples).

Installing packages from other sources

For these, we use the package BiocManager.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
pkgs.bioconductor = c("msa", "Biostrings")
pkgs.To.Install = which(! pkgs.bioconductor %in% installed.packages())
if (length(pkgs.To.Install) != 0) {
    BiocManager::install(pkgs.bioconductor[pkgs.To.Install])    
}

In BiocManager::install, the :: indicates that the function install comes from the package BiocManager.

So even if there is another package with the same function, R will know which one to use.

Github is a website that has folders for each user, and each user may have multiple folders: one for each project. Example, in:

  • https://github.com/KarenGoncalves/R_class_examples
    • KarenGoncalves/ is my folder in the website, in there I have over 20 folders, one for each project.
    • R_class_examples is one of them.
if (! "devtools" %in% installed.packages()) {
    install.packages("devtools")
}

repos.github = c("YuLab-SMU/ggmsa")
pkgs.github = gsub(".+/", "", repos.github) # remove the folder name
pkgs.To.Install = which(! pkgs.github %in% installed.packages())

if (length(pkgs.To.Install) != 0) {
    # To install the package, we need the folder + package name
    devtools::install_github(repos.github[pkgs.To.Install]) 
}

Copy the code below to load all the necessary packages.

In future, just come back here, copy the code below and change the list of packages.

devtools::source_gist("https://gist.github.com/KarenGoncalves/0db105bceff4ff69547ee25460dda978")

install_from_dif_sources(
    cran_packages = c("tidyverse", "tinytex", "patchwork"),
    bioconductor_packages = c("Biostrings", "msa", "treeio", "ggtree", "ape", "seqinr", "phangorn"),
    github_packages = "YuLab-SMU/ggmsa"
)

The link https://gist.github.com/KarenGoncalves/0db105bceff4ff69547ee25460dda978 contains the R script that creates functions to load the packages from different sources: CRAN (the typical source), Bioconductor and github.

So you just need to give the link to the function devtools::source_gist() and then you can use the function install_from_dif_sources

A note about classes

R objects (the things in the environment tab) belong to different classes. The basic ones you already know:

  • vectors have one dimension and data inside has to be all the same type: character vector; numeric vector; factor vector
  • matrices: vectors, but with columns and rows
  • data.frames: table made of vectors (columns) of the same size
  • lists: anything goes! If the objects stored in it have names, you can access them like list$object_name, as if the object inside was a column of a table

To check the class of an object you can use:

class(object)

A package may create it’s own object class. Sometimes you need to convert an object from one class to another, like:

(numeric_vector <- 1:10)
 [1]  1  2  3  4  5  6  7  8  9 10
class(numeric_vector)
[1] "integer"
(character_vector <- as.character(numeric_vector)) # change class
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
class(character_vector)
[1] "character"

So, just remember that, if a function starts with as., it is changing the class of the object.

Reading and writing fasta files

If you have protein sequences in a fasta file, copy it to your inputs folder, else use the following: Reductases.fasta

# To open a fasta file with multiple sequences, use the function readDNAStringSet() or readAAStringSet()
fasta_input = "https://karengoncalves.github.io/Programming_classes/exampleData/Reductases.fasta" %>%
    readAAStringSet(format = "fasta")
head(fasta_input)
AAStringSet object of length 6:
    width seq                                               names               
[1]   318 MATENKILILGPTGAIGRHIVWA...IEASEAYPDVTYTTADEYLNQFV IFR Medicago sativa
[2]   330 MAAGFLFHMGSLPAIATVGHKSK...GEATKLYPEVGYTTVVEYMKRYV PILR2 Linum usita...
[3]   598 MVGSIVGSNMAATDARFLSSNFG...VAHLPDKANNYLTPALSVLEKNT HC173 Arabidopsis...
[4]   323 MTTGKGKILILGATGYLGKYMVK...EYLKICLVNPPKPKLATYAQPST IGS1 Petunia hybrida
[5]   314 MEENGMKSKILIFGGTGYIGNHM...FTTIDELLDIFVHDPPPPASAAF EGS1 Ocimum basil...
[6]   310 MVASEKSKILIIGGTGYIGKYLV...VEATELYPKVKYTTVDEFYNKFV IFRH Nicotiana gl...
# If you want to select just one or two sequences and export it, you can do:
writeXStringSet(fasta_input[1:3], 
        filepath = "input_subset.fasta", 
        format = "fasta")

Multiple Sequence Alignment (MSA)

Make things easy for yourself later, add the names of the parameters of each function so you know what you need to put next time.

  • There are 4 functions for MSA:
    • msaClustalW - ClustalW is a general purpose multiple alignment program for DNA or proteins. 1
      • cluster parameter can be either Neighbor Joining ("nj") or Unweighted Pair Group Method with Arithmetic mean ("upgma")
        • Do not use "upgma" if you are on windows and your R version is 4.x (x being any number) - R crashes
    • msaClustalOmega - New MSA tool that uses seeded guide trees and HMM profile-profile techniques to generate alignments. Suitable for medium-large alignments.2
      • default substitutionMatrix: "Gonnet"; other options: "BLOSUM30", "BLOSUM40, "BLOSUM50", "BLOSUM65" and "BLOSUM80".
    • msaMuscle - Accurate MSA tool, especially good with proteins. Suitable for medium alignments.2
      • cluster parameter can be either NJ ("neighborjoining") or UPGMA ("upgma") - UPGMA gives better results 3
    • msa - you can set the method with the argument method or just put the sequences and use the default ("Clustalw") (other options: "ClustalOmega", "Muscle")
  • Add verbose = T so that the parameters used are printed to the screen and you can take note for your publications.

Running MSA

# The dataset used as example has 48 sequences, we will subset it for the alignment, selecting only 10
fasta_for_alignment <- fasta_input[1:10]
# use ?msa to know what you need to put in the function and what the default values are
myFirstAlignment <- msa(fasta_for_alignment, 
            method = "ClustalOmega",
            verbose = T
)
params: clustalo --R -o tempClustalOmega.aln --outfmt=clustal --seqtype=protein --force --gapopen=6.000000 --gapext=1.000000 --cluster-size=100 --iter=0 --output-order=tree-order
using Gonnet
myFirstAlignment
ClustalOmega 1.2.0 

Call:
   msa(fasta_for_alignment, method = "ClustalOmega", verbose = T)

MsaAAMultipleAlignment with 10 rows and 780 columns
     aln                                                   names
 [1] -------------------------...------------------------- 5BPOR Digitalis l...
 [2] -------------------------...----------ATYAQPST------- IGS1 Petunia hybrida
 [3] -------------------------...----------AAF------------ EGS1 Ocimum basil...
 [4] -------------------------...------------------------- PILR2 Linum usita...
 [5] -------------------------...------------------------- IFR Medicago sativa
 [6] -------------------------...------------------------- IFRH Nicotiana gl...
 [7] MVGSIVGSNMAATDARFLSSNFGNS...ELVAHLPDKANNYLTPALSVLEKNT HC173 Arabidopsis...
 [8] -------------------------...------------------------- ADH1 Catharanthus...
 [9] -------------------------...------------------------- ADH1 Euphorbia la...
[10] -------------------------...------------------------- ADRC1 Arabidopsis...
 Con -------------------------...------------------------- Consensus 

Visualizing the result

Just calling the MSA result will not show the whole alignment. For that you can use print()

print(myFirstAlignment, show = "complete", showConsensus = T)

Use ggmsa for visualization

To see the alignment as a plot, we can use ggmsa.

We need to change the type of object of the alignment result, as ggmsa only accepts "AAMultipleAlignment" objects

# The original result is of class MsaAAMultipleAlignment
myFirstAlignment2 <- myFirstAlignment # keep the original, make a copy to modify it
class(myFirstAlignment2) <-  "AAMultipleAlignment"

# We will plot only a part of the alignment - from the AA 220 to the AA 250 - so it is visible 
ggmsa(myFirstAlignment2, 
      start = 220, end = 250, 
      seq_name = T)

# We can see also the consensus highlighted
ggmsa(myFirstAlignment2, 
      start = 220, end = 250, 
      seq_name = T, consensus_views = T)

# Remove the # below if you want to save the figure
#ggsave("First_alignment_consensus_220_250.tiff")

More ggmsa options

We can set the width of the characters (use char_width = #), so the residues are not so big

ggmsa(myFirstAlignment2, 
      start = 220, end = 250, 
      seq_name = T, char_width = 0.5)

A seqlogo is a plot with letters on top of each other representing their frequency at each position in the alignment. The bigger a letter, the more frequent it is at a given position.

  • We can add this plot with + geom_seqlogo()
    • you can use different color schemes with geom_seqlogo(color = #)1
    • you can set your color scheme with geom_seqlogo(custom_color = data.frame)1
ggmsa(myFirstAlignment2, 
      start = 220, end = 250, 
      seq_name = T, char_width = 0.5) + 
    geom_seqlogo() 

# Note that the gaps are also represented in the seqlogo!

1 Check the options with ?geom_seqlogo

ggmsa(myFirstAlignment2, 
      start = 220, end = 250, 
      seq_name = T, char_width = 0.5) + 
    geom_seqlogo() +
    geom_msaBar()

Make a phylogenetic tree

Learn what is necessary to make a tree, how to read it and what conclusions you can and cannot draw from them.

  • Sequences used for the tree are the leaves/tips.
  • The point where the tree starts (the center if it is circular, or the first tip to separate from the rest of the tree) is the root
  • The lines drawing the tree are the branches/edges.
  • Going from the leaves to the root, the point where two branches connect is called a node

Compute the distance matrix

We will use ape and seqinr to create a tree from the alignment. But we need to, again, transform the type of object we got from the msa.

Then we calculate a distance matrix.

A distance matrix shows how different each sequence is from the others. It has as many rows as columns, since every sequence is compared to the others.

  • Its diagonal is all 0s, since the difference between a sequence and itself is 0.
  • The other rows and columns are values that indicate the amount of mismatches between two sequences. The greater the number, the less similar the two sequences are.
# We need to first export the alignment and import it again to transform it 
myFirstAlignment3 <- 
    myFirstAlignment %>% as.phyDat()

# Then, we create a distance matrix
distance_alignment <- 
    dist.ml(x = myFirstAlignment3, 
        model = "Blosum62")

Tree computation

Then, we use the distance matrix to compute the tree, joining sequences most similar, then joining groups of sequences. Each method of computing will do this joining differently.

This function performs the neighbor-joining tree estimation of Saitou and Nei (1987).

nj(X = distance_alignment) %>% plot(main = "ape::nj()")

nj(X = distance_alignment) %>% 
    plot(main = "phangorn::NJ()")

This function performs the BIONJ algorithm of Gascuel (1997).

bionj(X = distance_alignment) %>% plot(main = "ape::bionj")

UNJ(distance_alignment) %>% plot(main = "phangorn::UNJ")

The two FastME (fast minimum evolution) functions (balanced - bal - and ordinary least-squares - ols) perform the minimum evolution algorithm of Desper and Gascuel (2002).

fastme.bal(distance_alignment) %>% plot(main = "FastME BAL")

fastme.ols(distance_alignment) %>% plot(main = "FastME OLS")

  • UPGMA: uses the function hclust (hierarchical clustering) with method = "average", then transforms the dendrogram into the class phylo.
  • WPGMA: uses the function hclust with method = "mcquitty", then transforms the dendrogram into the class phylo.

Run ?hclust then look the section “Details”: the methods are explained with more details with references.

upgma(distance_alignment) %>% plot(main = "UPGMA")

wpgma(distance_alignment) %>% plot(main = "WPGMA")

Customizing the plot

We can use dplyr to separate protein and species names.

upgma_tree <- upgma(distance_alignment)
# We save the tree in a object so we can access the leaf labes with $tip.label

We will use a function called parse to italicize only the species name. It evaluates the text and executes functions inside it! Example, in:

"'TYDC '~italic('Narcissus pseudonarcissus')"

The texts 'TYDC ' and 'Narcissus pseudonarcissus' (in italic) are joined by the symbol ~.

In this case, the protein name has just letters and numbers and is at the start of the sequence name, followed by the species name. So we can use gsub to create the label we want.

Seqnames_tree <- 
    gsub("(\\w+) (.+)", 
         "'\\1 '~italic('\\2')",
         upgma_tree$tip.label)
upgma_tree$tip.label <- Seqnames_tree

Check the result before plotting. See that your result looks like the example with TYDC above, as you need to use both double and single quotes.

ggtree(upgma_tree) +
    # use geom_tiplab to plot the seqnames
    # add parse = T so R executes the function inside the text
    geom_tiplab(parse = T) +
    xlim(c(0, 2)) 

    # the upper x limit may change depending on the length of your labels, play around until you get the whole label but little empty space

More customization

Let’s say you want to add bars marking different groups in a tree, like the vertical bars marking the clades in the figure below (Figure 3 in Hori et al. 2006):

Highlight important nodes

Just as we added the names of the leaves, we can add the names of the nodes with geom_text(aes(label=node))

We may need to play around with the hjust, which goes after aes(), inside geom_text()

ggtree(upgma_tree) +
    geom_text(aes(label = node), hjust = -.2) +
    geom_tiplab(parse = T, hjust = -.15) +
    xlim(c(0, 2))

In our example tree, there are 2 main groups: nodes 12 and 13. We can use the functiongeom_cladelabel to mark each one

There is also geom_cladelab, but you cannot change the color of the text with it

ggtree(upgma_tree) +
    # we do not need the line below anymore
    #geom_text(aes(label = node), hjust = -.2) +
    geom_tiplab(parse = T) +
    geom_cladelabel(node = 12, label = "Node 12",
            color = "red", offset = 1) +
    geom_cladelabel(node = 13, label = "Node 13",
            color = "green4", offset = 1,
            barsize = 2) +
    xlim(c(0, 2.5))

You can draw a rectangle around the node and its branches using geom_highlight

ggtree(upgma_tree) +
    # we do not need the line below anymore
    #geom_text(aes(label = node), hjust = -.2) +
    geom_tiplab(parse = T) +
    geom_highlight(node = 12,
               fill = "red") +
    # change the transparency of the rectangle with alpha
    geom_cladelabel(node = 13, label = "Node 13",
            color = "green4", offset = 1,
            barsize = 2) +
    xlim(c(0, 2.5))

Statistical support

The statistical support on phylogenetic trees is calculated with bootstrap. In simple terms, bootstrap is repetition. In more complicated terms, the tree computation (including the calculation of the distance matrix) is repeated n times1, and the result shows the proportion of times the node appeared.

If out of 1000 times, a given group appears 100 times, its support is 10%.

1 n is determined by the number of sequences you are working with.

  • n is the number of times the calculation was repeated, so it includes the first time. That is why you may see it reported as ‘bootstrap was performed with 999 replicates’ (plus the original tree calculation, you get 1 000 replicates)
  • Although more is better, using n=10 000 for a tree of 10 sequences is overkill and may take too much time to calculate.
  • The most typical n used (for < 50 sequences in a tree) is 999

Bootstrap support

Let’s consider that we want to use upgma to calculate our tree and we want to calculate the support of the branches.

We use the function bootstrap.phyDat the same way we use sapply: using this dataset, perform this set of functions.

upgma_tree <- myFirstAlignment3 |>
    dist.ml() |> upgma()
bs_upgma <- bootstrap.phyDat(
    myFirstAlignment3, 
    bs = 100, \(x){
        dist.ml (x) |> upgma()
    })

bs_plot <- plotBS(upgma_tree, bs_upgma)

# The data stored in bs_plot is exactly what we need for ggtree

Since this is a new file, we need to correct the sequence names again.

We also set a threshold for bootstrap. Only values passing the threshold are shown.

bs_plot2 <- bs_plot

# We correct the sequence names
Seqnames_tree <- 
    gsub("(\\w+) (.+)", 
         "'\\1 '~italic('\\2')",
         bs_plot$tip.label)
bs_plot2$tip.label <- Seqnames_tree

# Bootstrap threshold
bs_threshold <- 50
bs_plot2$node.label <- 
    ifelse(bs_plot$node.label < bs_threshold,
           "", bs_plot$node.label)
ggtree(bs_plot2) +
    geom_tiplab(parse = T) +
    geom_highlight(node = 12,
               fill = "red") +
    geom_nodelab(nudge_x = 0.03) +
    # we use nudge_x to move the label away from the branches
    xlim(c(0, 1.5))

We export the trees to a newick (.nwk) file

This format is readable by any program working with trees.

write.tree(bs_upgma, file = "bootstrap_upgma_trees.nwk")

Resources

Aminoacid substitution models for dist.ml()

Specific