How to create a Heatmap
Downloading Data
For this session we are going to utilize the dataset from the https://encode.project.org/.
You can download the txt file here:
Install Packages and Loading Libraries
First step is to install packages either from CRAN, Bioconductor or from Github and to load the libraries.
library(data.table)
library(dplyr)
library(tidyverse)
library(plyr)
library(scales)
library(tidyquant)
Importing Data
Navigate your directory to your folder where you have saved the data set. Now we will go ahead to import our data set using the fread function.
my_data <- fread(filename)
Data wrangling
In order to have a clean data set we need to add column names and to remove the columns that are not needed for our final heatmap output.
colnames(my_data)[1:4] <- c("chrom","start","stop","type")
my_data <- my_data[,c(1:4)] #Keeping only the columns 1 to 4
my_data$type <- gsub('[[:digit:]]+_','',my_data$type) #removing the numbers and underscores from our type column
glimpse(my_data)
## Rows: 571,339
## Columns: 4
## $ chrom <chr> "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", …
## $ start <int> 10000, 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 2…
## $ stop <int> 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 22937, 2…
## $ type <chr> "Repetitive/CNV", "Heterochrom/lo", "Insulator", "Weak_Txn", "We…
Split the data frame
We can now split our data frame by chromosome and type and apply a function summarizing the length for each type per chromosome using the ddply function from the plyr library.
my_data2 <- ddply(my_data, .(type, chrom), summarise , no = length(type))
glimpse(my_data2)
## Rows: 276
## Columns: 3
## $ type <chr> "Active_Promoter", "Active_Promoter", "Active_Promoter", "Active…
## $ chrom <chr> "chr1", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "c…
## $ no <int> 1497, 596, 759, 788, 323, 498, 492, 668, 901, 243, 1085, 1011, 3…
Pivot_longer & Pivot_wider
pivot_longer() makes dataset longer by increasing the number of rows and decreasing the number of columns, whereas pivot_wider() is often utilized to tidy long data sets and often required for the purpose of the analysis (for example creating a heatmap)
my_data_final <- my_data2 %>% group_by(type) %>%
mutate(prop = no/sum(no)) %>%
ungroup() %>%
pivot_wider(
id_cols = type,
names_from = chrom,
values_from = prop
) %>% arrange(-`chr19`) %>%
mutate(type = fct_reorder(type, `chr19`)) %>%
pivot_longer(
cols = -type,
names_to = "chrom",
values_to = "prop"
)
head(my_data_final)
## # A tibble: 6 × 3
## type chrom prop
## <fct> <chr> <dbl>
## 1 Weak_Enhancer chr1 0.0299
## 2 Weak_Enhancer chr10 0.0151
## 3 Weak_Enhancer chr11 0.0148
## 4 Weak_Enhancer chr12 0.0158
## 5 Weak_Enhancer chr13 0.00824
## 6 Weak_Enhancer chr14 0.0106
Creating a Heatmap
Finally we are ready to generate our first heatmap! One way to do that is to use the ggplot library.
heatmap <- my_data_final %>%
ggplot(aes(chrom, type)) +
geom_tile(aes(fill = prop)) +
geom_text(aes(label = scales::percent(prop, accuracy = 1)),
size = 3) +
scale_fill_gradient(low = "white",high = palette_light()[1]) +
labs(
title = "Percentage of each of the types per chromosome",
x = "Chromosomes",
y = "Type"
)+
theme_tq() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(face = "bold"))
heatmap