2024-03-21
The input file is the table Arabidopsis_counts.csv.
counts = read.csv("https://karengoncalves.github.io/Programming_classes/exampleData/Arabidopsis_counts.csv",
header = T, row.names = 1)
This file has 12 columns (3 samples * 4 replicates) and 32833 rows (one per gene).
Samples are aerial tissue of Arabidopsis thaliana ecotype Col-0, harvested at 14 days post germination, with Basta (glyphosate) resistance gene and:
The design table explains to R what each column is:
Basically, you need first one column with the rows containing the same values as the column names of the count matrix.
Then, you need one (or more) columns that indicate the sample that replicate comes from.
Note that you separate things in the columns, eg. you harvested different tissues at different time points, you would need a design table with three columns: Replicates
, tissue_type
and Time_point
.
In this case, the genotypes of the plants is part of their name (just remove the _#
to get it):
design_matrix = design_matrix %>%
mutate(Genotype = gsub("_\\d+$", "",
# replace the underscore and digits at the end of the text
# with nothing
Replicates))
head(design_matrix)
Replicates Genotype
1 Control_1 Control
2 Control_2 Control
3 Control_3 Control
4 Control_4 Control
5 Mlp37347_1 Mlp37347
6 Mlp37347_2 Mlp37347
Not all genes in a genome will be expressed in a study (even one with many conditions and developmental stages), since some need very specific conditions to be expressed. Check for example the gene AT1G01046 (row 5)
Control_1 Control_2 Control_3 Control_4 Mlp37347_1 Mlp37347_2
AT1G01046 0 0 0 0 0 0
Mlp37347_3 Mlp37347_4 Mlp124499_1 Mlp124499_2 Mlp124499_3 Mlp124499_4
AT1G01046 0 0 0 0 0 0
We can clean up the data, removing genes that are not expressed ever in our experiment. In theory, genes not expressed will have read count == 0
in all samples.
Another way of filtering weakly expressed genes would be with the functions DAFS from the “KarenGoncalves/CustomSelection” package.
The function gives you the minimum expression level accepted per replicate (based on the sequencing depth), from there you can filter out the genes that are expressed at a level lower than that threshold.
You will need the count matrix and a named vector with the length of each gene.
# The first thing you do convert the read counts to TPM (transcripts per million):
tpm = Counts_to_tpm(counts=counts, featureLength=length)[[1]]
# Then you calculate the minimum expression level
cutv = DAFS(tpm=tpm) # value given as log2 TPM
# Calculate the mean minimum expression level and convert it to TPM
min_expr_level = mean(cutv) ^ 2
# Now to filter the weakly expressed genes,
# Remove the ones that are expressed below the min_expr_level
expressed_genes = apply(tpm, 1, mean) > min_expr_level
counts_filtered = countsData[, expressed_genes]
Now that you have the filtered counts matrix and the design table, you can run DESeq2 by first creating the dataset object then running the function DESeq.
Pay attention to always put your control samples at the beginning of the count matrix and the design table, as the first sample is taken as the control.
We used the design ~Genotype
, which is similar to taking the expression level for each gene and compare it between the different genotypes.
It takes the first level of the factor Genotype
as the control.
Because of this, the comparisons we are able to make are only Genotype_x
vs Genotype_Control
, we cannot compare Genotype_x
against Genotype_y
Use resultsNames
to see what comparisons are available
If we want to make all possible pairwise comparisons, we need to add a 0
to the design function.
If your samples are defined by 2 factors, say Genotype
and Age
(both present int the design_matrix
data.frame), you can set the design to either:
~Genotype + Age
- you will be able to compare genotype1
against genotype2
, and age1
against age2
, but not genotype1_age1
against genotype1_age2
~Genotype + Age + Genotype:Age
- models the differences:
Age
valueAge
.If you have a time series experiment, then you need to add a “reduced” design in the DESeq
function:
Now the main part of the analysis is done, we use the function results
to access the results and inform the type of stats you want performed.
[1] "Intercept" "Genotype_Mlp124499_vs_Control"
[3] "Genotype_Mlp37347_vs_Control"
# Remove the first element of name_result, since it is just the intercept
results_DESeq2 <-
lapply(name_result[-1], \(comparison) {
results(deseq_dataset,
name = comparison,
pAdjustMethod = "BH")@listData %>%
as_tibble() %>%
mutate(
# We need to add the gene names
Gene_name = rownames(counts_filtered),
# We classify the genes by deregulation
State = case_when(
log2FoldChange < 2 ~ "Down-regulated",
log2FoldChange > 2 ~ "Up-regulated",
.default = "equal"),
# Indicate from which comparison the data comes from
Contrast = comparison
)
}) %>%
list_rbind()
head(results_DESeq2)
# A tibble: 6 × 9
baseMean log2FoldChange lfcSE stat pvalue padj Gene_name State
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 73.8 -1.28 0.329 -3.89 0.0000989 0.00302 AT1G01010 Down-regul…
2 149. -0.296 0.209 -1.42 0.156 0.544 AT1G01020 Down-regul…
3 71.1 -0.709 0.248 -2.86 0.00430 0.0562 AT1G01030 Down-regul…
4 800. -0.122 0.216 -0.563 0.573 0.884 AT1G01040 Down-regul…
5 482. -0.308 0.330 -0.933 0.351 0.759 AT1G01050 Down-regul…
6 3353. -1.38 0.624 -2.21 NA NA AT1G01060 Down-regul…
# ℹ 1 more variable: Contrast <chr>
# A tibble: 2 × 2
# Groups: Contrast [2]
Contrast n
<chr> <int>
1 Genotype_Mlp124499_vs_Control 934
2 Genotype_Mlp37347_vs_Control 1583
# A tibble: 4 × 3
# Groups: Contrast, State [4]
Contrast State n
<chr> <chr> <int>
1 Genotype_Mlp124499_vs_Control Down-regulated 807
2 Genotype_Mlp124499_vs_Control Up-regulated 127
3 Genotype_Mlp37347_vs_Control Down-regulated 1467
4 Genotype_Mlp37347_vs_Control Up-regulated 116