2024-03-28
counts = read.csv("https://karengoncalves.github.io/Programming_classes/exampleData/Arabidopsis_counts.csv",
header = T, row.names = 1)
design_matrix =
data.frame(Replicates = names(counts)) %>%
mutate(Genotype =
gsub("_\\d+$", "",
Replicates))
genes_not_expressed = rowSums(counts) == 0
counts_filtered = counts[ !genes_not_expressed, ]
results_DESeq2 <-
lapply(resultsNames(deseq_dataset)[-1], \(comparison) {
results(deseq_dataset,
name = comparison,
pAdjustMethod = "BH")@listData %>%
as_tibble() %>%
mutate(
Gene_name = rownames(counts_filtered),
State = case_when(
log2FoldChange < -2 & padj < 0.01 ~ "Down-regulated",
log2FoldChange > 2 & padj < 0.01 ~ "Up-regulated",
.default = "equal"),
Contrast = comparison
)
}) %>%
list_rbind()
There are two basic ways we can check for sample homogeneity, meaning how similar the replicates of a sample are to each other: principal component analysis (PCA) or clustering analysis.
To perform either of them, we need to use the normalized data, not the read counts.
You can either use rlogTransformation
or varianceStabilizingTransformation
for PCA, as they return an object that includes the metadata. The first takes about 5x longer than the second, but results are similar. Check the help page for details on the computation.
To see the transformed counts, you need to run assay
on the result.
rlogTransformation
Control_1 Control_2 Control_3 Control_4 Mlp37347_1 Mlp37347_2
AT1G01010 6.856238 6.614363 6.614790 6.165110 5.471233 5.796102
AT1G01020 7.337836 7.045180 7.355484 7.501417 7.186908 7.253583
AT1G01030 6.490673 6.253299 6.147981 6.267220 6.017791 6.081412
AT1G01040 9.989951 9.705570 9.593858 9.700406 9.418827 9.425975
AT1G01050 8.486265 8.897727 9.000133 9.285109 8.968637 8.808020
AT1G01060 12.205905 12.287763 12.279939 11.944880 9.507479 9.424758
Mlp37347_3 Mlp37347_4 Mlp124499_1 Mlp124499_2 Mlp124499_3 Mlp124499_4
AT1G01010 5.535472 5.900682 5.783818 5.766419 6.128911 5.693272
AT1G01020 7.104379 7.029025 7.304103 7.097543 7.095395 7.089999
AT1G01030 6.232429 6.298843 5.790084 5.812633 5.870933 6.115346
AT1G01040 9.402218 9.495822 9.673989 9.839446 9.826400 9.283701
AT1G01050 8.940594 9.085135 8.816797 8.795406 8.148923 9.082900
AT1G01060 9.575817 10.642220 10.511861 10.605765 10.339630 12.211014
varianceStabilizingTransformation
(VST) Control_1 Control_2 Control_3 Control_4 Mlp37347_1 Mlp37347_2
AT1G01010 7.590094 7.234199 7.235298 6.594604 5.592357 6.065537
AT1G01020 7.590094 7.149626 7.613059 7.822869 7.363027 7.462015
AT1G01030 7.025739 6.670862 6.517423 6.687298 6.327689 6.420063
AT1G01040 10.215535 9.790676 9.620667 9.782615 9.349971 9.360973
AT1G01050 8.335243 8.978957 9.133170 9.552363 9.085613 8.842539
AT1G01060 12.768174 12.878989 12.868370 12.406869 8.429246 8.248008
Mlp37347_3 Mlp37347_4 Mlp124499_1 Mlp124499_2 Mlp124499_3 Mlp124499_4
AT1G01010 5.660322 6.212961 6.038054 6.014203 6.547341 5.910029
AT1G01020 7.238271 7.122962 7.538112 7.228004 7.224890 7.217355
AT1G01030 6.642055 6.739925 5.976459 6.014203 6.103966 6.469575
AT1G01040 9.323073 9.469278 9.742963 9.991844 9.972339 9.135749
AT1G01050 9.043900 9.260498 8.855706 8.822978 7.784839 9.256563
AT1G01060 8.562044 10.476041 10.264868 10.417590 9.978984 12.774389
DESeq2 offers a function for PCA that takes the result from either of the two functions and returns a ggplot object.
Because of this, we can add things to the plot as if it was a ggplot.
DESeq2 offers a function for PCA that takes the result from either of the two functions and returns a ggplot object.
Because of this, we can add things to the plot as if it was a ggplot.
Hierarchical clustering is a graphical way to show the pairwise similarity between samples.
We just need to edit the rlog_transformed or the VST_data so that the samples are shown in the rows and the genes in the columns.
You can show the dendrogram and the distance matrix in a heatmap
If you want to plot it with ggplot to have more control, you need to do some transformations
# Pivot longer the distance matrix and add the row names
long_distance <-
distance_data %>% as.matrix() %>% as_tibble() %>%
# rownames of the distance matrix are in the attr(ibute) Labels
mutate(Rep1 = attr(distance_data, which = "Labels")) %>%
pivot_longer(cols = !Rep1,
names_to = "Rep2",
values_to = "distance")
If you want to plot it with ggplot to have more control, you need to do some transformations
# Pivot longer the distance matrix and add the row names
long_distance <-
distance_data %>% as.matrix() %>% as_tibble() %>%
# rownames of the distance matrix are in the attr(ibute) Labels
mutate(Rep1 = attr(distance_data, which = "Labels")) %>%
pivot_longer(cols = !Rep1,
names_to = "Rep2",
values_to = "distance")
# Add a column with the relative distance of the samples
(max_distance = max(long_distance$distance))
If you want to plot it with ggplot to have more control, you need to do some transformations
# Pivot longer the distance matrix and add the row names
long_distance <-
distance_data %>% as.matrix() %>% as_tibble() %>%
# rownames of the distance matrix are in the attr(ibute) Labels
mutate(Rep1 = attr(distance_data, which = "Labels")) %>%
pivot_longer(cols = !Rep1,
names_to = "Rep2",
values_to = "distance")
# Add a column with the relative distance of the samples
(max_distance = max(long_distance$distance))
[1] 106.0121
long_distance = long_distance %>%
mutate(Rel_distance = (distance / max_distance) * 100)
# Sort the sample names according to the dendrogram
long_distance$Rep1 =
factor(long_distance$Rep1,
levels = cluster_hc$labels[cluster_hc$order]
)
long_distance$Rep2 =
factor(long_distance$Rep2,
levels = cluster_hc$labels[cluster_hc$order]
)
Normally, we want to plot the deregulated genes, find patterns among them, etc. The first thing we can plot is a volcano plot for each contrast checked.
It is simply the fold change in the x-axis and the -log10p-adj in the y-axis.
Note that there are cases where the p-value, and thus the adjusted p-value, are NA, so we will create a new variable that replaces NA with 1 (those genes are not deregulated)
basic_volcanos +
xlim(-10, 10) +
scale_color_manual(name = "", values =
c("Up-regulated" = "red",
"Down-regulated" = "blue",
"equal" = "grey")) +
scale_alpha_manual(name = "", values =
c("Up-regulated" = 1,
"Down-regulated" = 1,
"equal" = 0.3)) +
labs(x = expression("log"[2]~"Fold Change"),
y = expression("-log"[10]~"adjusted p-Value")) +
theme(legend.position = "bottom")
In our case, we just have one factor (Genotype) in the design formula, and both contrasts are against the control sample.
So we can modify the table of results to remove the redundant information
Now we also want to sort the genes so that the ones with similar deregulation are grouped together. We can use a cluster analysis for that.
But, we need to use a wide table, not a long one, to compute the distance matrix. We will also filter out the genes that are not deregulated, so that the analysis don’t consider non-significant differences.
values_fill = 0
: after filtering, there will be genes with fold change levels present for only one contrast because they are not deregulated in the other condition. Since in the wide format, we need values for both contrasts, we replace the empty cell with 0.Now we can use wide_results
in the dist
function?
Yes, after we remove the Gene_name
column (dist only accepts numbers)
Now we just make a column of ordered genes
Whether you plot the genes in the x or in the y axis depends on how readable the plot is in the end (how many genes are deregulated, how many contrasts you have, etc)
Whether you plot the genes in the x or in the y axis depends on how readable the plot is in the end (how many genes are deregulated, how many contrasts you have, etc)
Whether you plot the genes in the x or in the y axis depends on how readable the plot is in the end (how many genes are deregulated, how many contrasts you have, etc)
results_DESeq2_filtered %>%
ggplot(aes(x = orderedGenes, y = shortContrast,
fill = ifelse(abs(log2FoldChange) > 2 & padj < 0.01,
log2FoldChange, 0)
)) +
geom_tile() +
scale_fill_gradient2(name = "Fold change",
mid = "white",
low = "midnightblue",
high = "red") +
theme(axis.title = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom",
axis.ticks = element_blank())