Basic Data Wrangling

Downloading Data

One of the most important lesson in R is filtering and cleaning up you data, in order to generate an easy to read and to handle dataset. First of all we need to import a dataset. There are many sources for publicly avaliable big data such as: https://genome.ucsc.edu/ , https://www.ensembl.org/, https://www.encodeproject.org/. I have already downloaded a dataset for you. You can download the txt file here:

Install Packages and Loading Libraries

First step is to install packages either from CRAN, Bioconductor or from Github and to load the libraries.

library(data.table)
library(dplyr)
library(tidyverse)
library(tm)

Importing Data

Now we will go ahead to import our dataset. There are many ways to import a dataset depending on the file type and the size of the file. First step is to navigate your console to your own directory with your folder containing the data below. Here I am going to show you two ways of importing a dataset with the latter one to be preferred for larger datasets.

Method 1

filename <- "Encode_HMM_data.txt"
my_data <- read.csv(filename, sep="\t", header=FALSE)

Method 2

library(data.table)
my_data <- fread(filename)
head(my_data)
##      V1    V2    V3                V4 V5 V6    V7    V8          V9
## 1: chr1 10000 10600 15_Repetitive/CNV  0  . 10000 10600 245,245,245
## 2: chr1 10600 11137 13_Heterochrom/lo  0  . 10600 11137 245,245,245
## 3: chr1 11137 11737       8_Insulator  0  . 11137 11737  10,190,254
## 4: chr1 11737 11937       11_Weak_Txn  0  . 11737 11937 153,255,102
## 5: chr1 11937 12137   7_Weak_Enhancer  0  . 11937 12137   255,252,4
## 6: chr1 12137 14537       11_Weak_Txn  0  . 12137 14537 153,255,102
glimpse(my_data)
## Rows: 571,339
## Columns: 9
## $ V1 <chr> "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "ch…
## $ V2 <int> 10000, 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 2293…
## $ V3 <int> 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 22937, 2693…
## $ V4 <chr> "15_Repetitive/CNV", "13_Heterochrom/lo", "8_Insulator", "11_Weak_T…
## $ V5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ V6 <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".…
## $ V7 <int> 10000, 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 2293…
## $ V8 <int> 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 22937, 2693…
## $ V9 <chr> "245,245,245", "245,245,245", "10,190,254", "153,255,102", "255,252…

Select Columns

We can see that succesfully we have imported our dataset in R studio. It’s a dataframe with 571,339 rows and 9 columns. We will go ahead and we will select the first 4 columns from our data set and to ignore the rest of the columns. Again there are many ways of doing that. Here I can show you 2 ways :

Method 1

my_data %>% select(V1,V2,V3,V4)
##           V1        V2        V3                V4
##      1: chr1     10000     10600 15_Repetitive/CNV
##      2: chr1     10600     11137 13_Heterochrom/lo
##      3: chr1     11137     11737       8_Insulator
##      4: chr1     11737     11937       11_Weak_Txn
##      5: chr1     11937     12137   7_Weak_Enhancer
##     ---                                           
## 571335: chrX 155251806 155255406 10_Txn_Elongation
## 571336: chrX 155255406 155257806       11_Weak_Txn
## 571337: chrX 155257806 155258806       8_Insulator
## 571338: chrX 155258806 155259606 13_Heterochrom/lo
## 571339: chrX 155259606 155260406 15_Repetitive/CNV

Method 2

my_data[,1:4]
##           V1        V2        V3                V4
##      1: chr1     10000     10600 15_Repetitive/CNV
##      2: chr1     10600     11137 13_Heterochrom/lo
##      3: chr1     11137     11737       8_Insulator
##      4: chr1     11737     11937       11_Weak_Txn
##      5: chr1     11937     12137   7_Weak_Enhancer
##     ---                                           
## 571335: chrX 155251806 155255406 10_Txn_Elongation
## 571336: chrX 155255406 155257806       11_Weak_Txn
## 571337: chrX 155257806 155258806       8_Insulator
## 571338: chrX 155258806 155259606 13_Heterochrom/lo
## 571339: chrX 155259606 155260406 15_Repetitive/CNV
sorted_data <- my_data[,1:4]

Adding Column Names

Now we have selected successfully the columns that we need. However, we need to rename the columns in order to have a more clean data set.

colnames(sorted_data)[1:4] <- c("Chromosome","Start","Stop","Type")
head(sorted_data)
##    Chromosome Start  Stop              Type
## 1:       chr1 10000 10600 15_Repetitive/CNV
## 2:       chr1 10600 11137 13_Heterochrom/lo
## 3:       chr1 11137 11737       8_Insulator
## 4:       chr1 11737 11937       11_Weak_Txn
## 5:       chr1 11937 12137   7_Weak_Enhancer
## 6:       chr1 12137 14537       11_Weak_Txn

Final Clean-Up

Now we have a more defined clean data set. However, there is a string including a number and an underscore as prefix in each character value of the column Type. Below again 2 ways to remove this string:

Method 1

sorted_data$Type <- gsub('[[:digit:]]+_','', sorted_data$Type)

Method 2

sorted_data$Type <- str_replace_all(sorted_data$Type, "[:digit:]+_","")
head(sorted_data)
##    Chromosome Start  Stop           Type
## 1:       chr1 10000 10600 Repetitive/CNV
## 2:       chr1 10600 11137 Heterochrom/lo
## 3:       chr1 11137 11737      Insulator
## 4:       chr1 11737 11937       Weak_Txn
## 5:       chr1 11937 12137  Weak_Enhancer
## 6:       chr1 12137 14537       Weak_Txn

And here you have it ! A final clean data set easy to read and work with.

Feel free to download the files from github:

Download FILES

Next