2023-08-10
To work with your files, create new data and work with it, etc., you will want to save your things into R’s memory.
These things are variables.
How to create a variable in R:
=
or <-
to separate the name from the value you want it to storeBad names | Why? | Good names |
---|---|---|
x, y, z, counts | x, y, z don't tell us what the data there has, so you will not know later what you used | rawCountsRNAseq, fluorescenceTableDay1 |
bacteriacountsforday5 | Could be a good name, but is difficult to read, use upper and lower case or _ to improve readability | bacteriaCountsForDay5 |
microplate | Does not tell what kind of data is stored | microplateFluorescence, microplateConcentration, microplateOD |
*(),\/"';:<>{}[]~!@#$%^+=
You use R to do something with your data:
Calculate stats, eg. mean, standard deviation
Know if two or more groups are different with T-test or ANOVA.
You do these things using functions.
They are simply a bunch of code stored in a variable with an specific name.
You use them by typing the name followed by ()
.
Most of the time, you need to put something inside the ()
, eg. a table, a column in a table, several numbers, etc.
help("mean")
or ?mean
will open a page in your browser (if you are using R) or the tab “Help” (if you are using RStudio) with an explanation of what the function does and how to use it
You load files into R using functions: read.delim()
, read.csv()
, read_csv()
(this is different to the previous one)
"\t"
)myFile <- read.delim(file = "Inputs/myFile.txt", header = T, row.names = 1)
"Inputs/myFile.txt"
= where the file is stored, has to be inside single or double quotesheader = T
or header = TRUE
are the same thing, but all CAPS alwaysrow.names = 1
tells R which column has the information that identifies each row as a different thing, you can use any column here, or you can omit this,
(try writing a function without commas and see what happens)Column separation
read.delim()
was made to read tables with columns separated by tabs
;
, you would write:myFile <- read.delim(file = "Inputs/myFile.txt", header = T, row.names = 1, sep = ";")
Load your dataset (from the preparatory slides To do before the first class
, Class1_exampleData.txt) and save it to the variable "myFirstInput"
"https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt"
or"Inputs/Class1_exampleData.txt"
If you want to see that the file is okay, you can check if the beginning and end of the table looks right:
head(myFirstInput)
- will print the first 6 rows of a table.
tail(myFirstInput)
- will print the last 6 rows of a table.
n=x
, where x
is the number of lines you want to see X Control_1 Control_2 Control_3 Treated_1 Treated_2 Treated_3
1 A 0.12292920 0.8274256 0.1485878 0.08604300 0.31445400 0.2195038
2 B 0.09068010 0.1097931 0.3352014 0.14033150 0.24412490 0.4332501
3 C 0.10680465 0.4686093 0.2418946 0.11318725 0.27928945 0.3263769
4 D 0.01612455 0.3588163 0.0933068 0.02714425 0.03516455 0.1068732
5 E 0.15325396 0.4616864 0.9418977 0.32840471 0.09089286 0.3163602
str(myFirstInput)
- will show the structure of your table
summary(myFirstInput)
- if the values are numbers, will calculate the quantiles and mean of each column
'data.frame': 8 obs. of 7 variables:
$ X : chr "A" "B" "C" "D" ...
$ Control_1: num 0.1229 0.0907 0.1068 0.0161 0.1533 ...
$ Control_2: num 0.827 0.11 0.469 0.359 0.462 ...
$ Control_3: num 0.1486 0.3352 0.2419 0.0933 0.9419 ...
$ Treated_1: num 0.086 0.1403 0.1132 0.0271 0.3284 ...
$ Treated_2: num 0.3145 0.2441 0.2793 0.0352 0.0909 ...
$ Treated_3: num 0.22 0.433 0.326 0.107 0.316 ...
X Control_1 Control_2 Control_3
Length:8 Min. :0.01612 Min. :0.1098 Min. :0.09331
Class :character 1st Qu.:0.10277 1st Qu.:0.3299 1st Qu.:0.21857
Mode :character Median :0.13809 Median :0.4651 Median :0.50057
Mean :0.26865 Mean :0.4395 Mean :0.51666
3rd Qu.:0.26532 3rd Qu.:0.5212 3rd Qu.:0.84831
Max. :0.98105 Max. :0.8274 Max. :0.94190
Treated_1 Treated_2 Treated_3
Min. :0.02714 Min. :0.03516 Min. :0.1069
1st Qu.:0.10640 1st Qu.:0.20582 1st Qu.:0.2921
Median :0.23437 Median :0.29687 Median :0.3798
Mean :0.31319 Mean :0.38945 Mean :0.4565
3rd Qu.:0.46944 3rd Qu.:0.52156 3rd Qu.:0.5411
Max. :0.82128 Max. :0.98047 Max. :0.9785
If you enter a text in a column that has numeric data, the column will be treated as text (character)
summary()
acts as in the previous slideUse R built-in data to see the difference between characters and factors
LETTERS and letters are vectors (a list of values of same type) of upper/lower case letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
[1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
as.numeric(X)
is a function that transforms the data in X into numbers, if it is possible.as.factor
is a function that transforms the vector into a factorNA
means “Not Available”, R does not know what to do with characters when you want numbers from it, so the result is “not available”
NaN
means “Not A Number”; Inf
means “infinite”
Hover your mouse over the code and a button to copy the whole code block will appear!
R has datasets already loaded for classes like these.
Check what the PlantGrowth
dataset looks like
head(PlantGrowth)
summary(PlantGrowth)
Save PlantGrowth
into your environment with a new name
myPlantGrowth = head(PlantGrowth)
Explain your code with comments
Inside R (as well as unix and python), anything you write after a #
in a line is not read by the computer. You can use this to explain your code in your own words so you and anyone reading your code can understand it.
If you load your data and indicate the column containing the names of the rows, you can use the name of the row
If you did not set the row names, just use the number of the row.
HOW TO
table_name[ row_name , ]
The row name/number HAS TO come BEFORE the ","
myPlantGrowth = head(PlantGrowth)
# Use the name of the table, and [], inside put the number of the row followed by ","
myPlantGrowth[1,] # prints the first row
weight group
1 4.17 ctrl
[1] "1" "2" "3" "4" "5" "6"
rownames(myPlantGrowth) = 6:1 # creates a sequence starting from 30 and ending in 1
rownames(myPlantGrowth)
[1] "6" "5" "4" "3" "2" "1"
To select a column in a table in R, you cannot click it as in excel, but you can call it by its name or position in the table.
All the commands below select the column “weight” in the data frame “myPlantGrowth”
# Use this to return a table with a single column
myPlantGrowth = head(PlantGrowth)
myPlantGrowth[1] # same with the column name: myPlantGrowth["weight"]
weight
1 4.17
2 5.58
3 5.18
4 6.11
5 4.50
6 4.61
# Use this to return just the values of the column (this structure is called a vector)
myPlantGrowth$weight
[1] 4.17 5.58 5.18 6.11 4.50 4.61
[1] 4.17 5.58 5.18 6.11 4.50 4.61
Always remember the position of your commas!
Use the function names
to check or set the names of your columns
Let’s use the dataset iris
that is inside R:
Check the structure of the dataset.
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Select one numerical column and make a boxplot, example - Sepal.Length
~
makes R separate the first item in relation to the categories in the second
Using iris
plot
, example: plot(column1, column2)
pairs(iris)
Using your data
Basics:
Function | File extenion | Column separation | Decimal separation |
---|---|---|---|
read.csv | .csv | sep = "," | dec = "." |
read.csv2 | .csv | sep = ";" | dec = "," |
read.delim | .txt, .tsv | sep = "\t" (tab) | dec = "." |
read.delim2 | .txt, .tsv | sep = "\t" (tab) | dec = "," |
If the column separation or the decimal separation is not the one expected by the function you choose, you can specify the correct one inside the function with:
sep = ' '
dec = '.'