R class - basics

Karen Cristine Goncalves, Ph.D.

2023-08-10

Variables

To work with your files, create new data and work with it, etc., you will want to save your things into R’s memory.

These things are variables.

Every person you know is stored in your brain as a variable (their names)
Every contact in your cellphone is a variable that stores their names, phone number, email, etc.

Variables - save your things in R

How to create a variable in R:

Start with the name you want
Use = or <- to separate the name from the value you want it to store

Variable names

Give meaningful names to your variables, or else you may not remember what they are when you read the script later
Bad names	Why?	Good names
x, y, z, counts	x, y, z don't tell us what the data there has, so you will not know later what you used	rawCountsRNAseq, fluorescenceTableDay1
bacteriacountsforday5	Could be a good name, but is difficult to read, use upper and lower case or _ to improve readability	bacteriaCountsForDay5
microplate	Does not tell what kind of data is stored	microplateFluorescence, microplateConcentration, microplateOD

Cannot start with numbers
Cannot have symbols, eg. *(),\/"';:<>{}[]~!@#$%^+=
Cannot have space

Functions - work with your data

You use R to do something with your data:

Calculate stats, eg. mean, standard deviation
Know if two or more groups are different with T-test or ANOVA.
You do these things using functions.

They are simply a bunch of code stored in a variable with an specific name.
You use them by typing the name followed by ().
Most of the time, you need to put something inside the (), eg. a table, a column in a table, several numbers, etc.

If you do not know how to use a function, ask for help (which is also a function!):

help("mean") or ?mean will open a page in your browser (if you are using R) or the tab “Help” (if you are using RStudio) with an explanation of what the function does and how to use it

Functions - examples

You load files into R using functions: read.delim(), read.csv(), read_csv() (this is different to the previous one)

Eg. myFile.txt is in the “Inputs” folder of my project, it is a table with column names (headers) and row names, the columns are separated by tabs ("\t")
myFile <- read.delim(file = "Inputs/myFile.txt", header = T, row.names = 1)
1. "Inputs/myFile.txt" = where the file is stored, has to be inside single or double quotes
2. header = T or header = TRUE are the same thing, but all CAPS always
3. row.names = 1 tells R which column has the information that identifies each row as a different thing, you can use any column here, or you can omit this
- Each of the three things above (arguments) are separated from each other inside the function by , (try writing a function without commas and see what happens)

Column separation

read.delim() was made to read tables with columns separated by tabs

If the columns were separated by something else, like ;, you would write:
myFile <- read.delim(file = "Inputs/myFile.txt", header = T, row.names = 1, sep = ";")

Exercise - load table into R

Load your dataset (from the preparatory slides To do before the first class, Class1_exampleData.txt) and save it to the variable "myFirstInput"

Note the columns are separated by tabs
You do not need to download a file to read it (use the link in place of the file path):
- "https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt" or
- Download it to your project’s “Inputs” folder and use the path "Inputs/Class1_exampleData.txt"

filepath = "https://karengoncalves.github.io/Programming_classes/exampleData/Class1_exampleData.txt"
myFirstInput = read.delim(filepath, header = T, row.names = 1)

Functions to check your table

If you want to see that the file is okay, you can check if the beginning and end of the table looks right:

head(myFirstInput) - will print the first 6 rows of a table.
tail(myFirstInput) - will print the last 6 rows of a table.
- For both, you can change the number of lines printed by adding n=x, where x is the number of lines you want to see

head(myFirstInput, n = 5)

  X  Control_1 Control_2 Control_3  Treated_1  Treated_2 Treated_3
1 A 0.12292920 0.8274256 0.1485878 0.08604300 0.31445400 0.2195038
2 B 0.09068010 0.1097931 0.3352014 0.14033150 0.24412490 0.4332501
3 C 0.10680465 0.4686093 0.2418946 0.11318725 0.27928945 0.3263769
4 D 0.01612455 0.3588163 0.0933068 0.02714425 0.03516455 0.1068732
5 E 0.15325396 0.4616864 0.9418977 0.32840471 0.09089286 0.3163602

Functions to check your table

str(myFirstInput) - will show the structure of your table
- Its class (data.frame or tibble are types of tables)
- The number of rows (observations) and columns (variables)
- The type of data inside each column and the their first few values
summary(myFirstInput) - if the values are numbers, will calculate the quantiles and mean of each column

str(myFirstInput)

'data.frame':   8 obs. of  7 variables:
 $ X        : chr  "A" "B" "C" "D" ...
 $ Control_1: num  0.1229 0.0907 0.1068 0.0161 0.1533 ...
 $ Control_2: num  0.827 0.11 0.469 0.359 0.462 ...
 $ Control_3: num  0.1486 0.3352 0.2419 0.0933 0.9419 ...
 $ Treated_1: num  0.086 0.1403 0.1132 0.0271 0.3284 ...
 $ Treated_2: num  0.3145 0.2441 0.2793 0.0352 0.0909 ...
 $ Treated_3: num  0.22 0.433 0.326 0.107 0.316 ...

summary(myFirstInput)

      X               Control_1         Control_2        Control_3      
 Length:8           Min.   :0.01612   Min.   :0.1098   Min.   :0.09331  
 Class :character   1st Qu.:0.10277   1st Qu.:0.3299   1st Qu.:0.21857  
 Mode  :character   Median :0.13809   Median :0.4651   Median :0.50057  
                    Mean   :0.26865   Mean   :0.4395   Mean   :0.51666  
                    3rd Qu.:0.26532   3rd Qu.:0.5212   3rd Qu.:0.84831  
                    Max.   :0.98105   Max.   :0.8274   Max.   :0.94190  
   Treated_1         Treated_2         Treated_3     
 Min.   :0.02714   Min.   :0.03516   Min.   :0.1069  
 1st Qu.:0.10640   1st Qu.:0.20582   1st Qu.:0.2921  
 Median :0.23437   Median :0.29687   Median :0.3798  
 Mean   :0.31319   Mean   :0.38945   Mean   :0.4565  
 3rd Qu.:0.46944   3rd Qu.:0.52156   3rd Qu.:0.5411  
 Max.   :0.82128   Max.   :0.98047   Max.   :0.9785

Know the types of data

If you enter a text in a column that has numeric data, the column will be treated as text (character)

With numeric columns, summary() acts as in the previous slide
With text columns, there are 2 options:
- character: will print the length, class and mode (most common value) of the column
- factor (categories): will print each category and their frequency
Characters cannot be transformed into numbers, factors (categories) can!
- That is because factors have levels (category 1, category 2, …), and to save space, R simply remember the level of each line

Test - difference between text and categories

Use R built-in data to see the difference between characters and factors

LETTERS and letters are vectors (a list of values of same type) of upper/lower case letters

letters

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

(myLetters = as.factor(letters))

 [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

as.numeric(myLetters)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26

as.numeric(letters)

Warning: NAs introduced by coercion

 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA

Functions
- as.numeric(X) is a function that transforms the data in X into numbers, if it is possible.
- as.factor is a function that transforms the vector into a factor

NA means “Not Available”, R does not know what to do with characters when you want numbers from it, so the result is “not available”

NaN means “Not A Number”; Inf means “infinite”

Hover your mouse over the code and a button to copy the whole code block will appear!

Exercise - playing with datasets already in R

R has datasets already loaded for classes like these.

Check what the PlantGrowth dataset looks like
- head(PlantGrowth)
- summary(PlantGrowth)
Save PlantGrowth into your environment with a new name
- myPlantGrowth = head(PlantGrowth)

Explain your code with comments

Inside R (as well as unix and python), anything you write after a # in a line is not read by the computer. You can use this to explain your code in your own words so you and anyone reading your code can understand it.

Row selection

If you load your data and indicate the column containing the names of the rows, you can use the name of the row

If you did not set the row names, just use the number of the row.

HOW TO

table_name[ row_name , ]

The row name/number HAS TO come BEFORE the ","

myPlantGrowth = head(PlantGrowth)

# Use the name of the table, and [], inside put the number of the row followed by ","
myPlantGrowth[1,] # prints the first row

  weight group
1   4.17  ctrl

# See and set rownames with the function rownames
rownames(myPlantGrowth)

[1] "1" "2" "3" "4" "5" "6"

rownames(myPlantGrowth) = 6:1 # creates a sequence starting from 30 and ending in 1
rownames(myPlantGrowth)

[1] "6" "5" "4" "3" "2" "1"

Column selection

To select a column in a table in R, you cannot click it as in excel, but you can call it by its name or position in the table.

All the commands below select the column “weight” in the data frame “myPlantGrowth”

# Use this to return a table with a single column
myPlantGrowth = head(PlantGrowth)
myPlantGrowth[1] # same with the column name: myPlantGrowth["weight"]

# Use this to return just the values of the column (this structure is called a vector)
myPlantGrowth$weight

[1] 4.17 5.58 5.18 6.11 4.50 4.61

myPlantGrowth[, "weight"] # same with the column number: myPlantGrowth[, 1]

[1] 4.17 5.58 5.18 6.11 4.50 4.61

Always remember the position of your commas!

Use the function names to check or set the names of your columns

See how much you’ve learned!

Let’s use the dataset iris that is inside R:

Check the structure of the dataset.

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Your first graph(s)

Select one numerical column and make a boxplot, example - Sepal.Length

~ makes R separate the first item in relation to the categories in the second

Basic boxplot
With colors
With clean axes titles
With new axes titles

boxplot(iris$Sepal.Length ~ iris$Species)

# Lets give it some color
colors = c("red", "green", "blue")
boxplot(iris$Sepal.Length ~ iris$Species, 
    col = colors)

colors = c("red", "green", "blue")
# If the column names are fine, we can use the function with to remove the table name from the title
with(iris, boxplot(Sepal.Length ~ Species, col = colors))

# Set a new axes titles with x/ylab
colors = c("red", "green", "blue")
boxplot(iris$Sepal.Length ~ iris$Species, col = colors, 
    ylab = "Sepal length (mm)", xlab = "Species epithet")

Homework

Using iris

Select 2 columns and plot them with plot, example: plot(column1, column2)
Plot everything against everything: pairs(iris)

Using your data

Create a boxplot with colored boxes, meaningful axes titles and a plot title

Learning more - Basic R plots

Cheat-sheet for loading files

Basics:

Normally, R assumes your data has column names

Function	File extenion	Column separation	Decimal separation
read.csv	.csv	sep = ","	dec = "."
read.csv2	.csv	sep = ";"	dec = ","
read.delim	.txt, .tsv	sep = "\t" (tab)	dec = "."
read.delim2	.txt, .tsv	sep = "\t" (tab)	dec = ","

If the column separation or the decimal separation is not the one expected by the function you choose, you can specify the correct one inside the function with:

sep = ' '
dec = '.'