How to read multiple csv files
Downloading Data
For this session we are going to utilise the dataset from the https://encode.project.org/.
You can download the txt file here:
Install Packages and Loading Libraries
First step is to install packages either from CRAN, Bioconductor or from Github and to load the libraries.
library(data.table)
library(dplyr)
library(tidyverse)
library(plyr)
library(fs)
Importing Data
Navigate your directory to your folder where you have saved the data set. Now we will go ahead to import our data set using the fread command.
my_data <- fread(filename)
Data wrangling
In order to have a clean data set we need to add column names and to remove the columns that are not needed for our final heatmap output.
colnames(my_data)[1:4] <- c("chrom","start","stop","type")
my_data <- my_data[,c(1:4)] #Keeping only the columns 1 to 4
my_data$type <- gsub('[[:digit:]]+_','',my_data$type) #removing the numbers and underscores from our type column
glimpse(my_data)
## Rows: 571,339
## Columns: 4
## $ chrom <chr> "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", …
## $ start <int> 10000, 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 2…
## $ stop <int> 10600, 11137, 11737, 11937, 12137, 14537, 20337, 22137, 22937, 2…
## $ type <chr> "Repetitive/CNV", "Heterochrom/lo", "Insulator", "Weak_Txn", "We…
Creating a new directory
We can now create a new directory where we can save our .csv files.
new_directory <- "~/Documents/MyWebPage/content/blog/example/my_csv_files/"
dir_create(new_directory)
Spliting our dataframe
We are now ready to split our data frame into multiple .csv files and we can save them under the directory /my_csv_files.
setwd("~/Documents/MyWebPage/content/blog/example/my_csv_files/")
my_data %>%
group_by(chrom) %>%
group_split() %>%
map(
.f = function(data) {
write_csv(data, path = unique(data$chrom))
}
)
Importing multiple csv files [1]
We can do this in multiple ways, either using the lapply function after generatiing a list or the map function. Let’s go ahead and use the first way :
setwd("~/Documents/MyWebPage/content/blog/example/my_csv_files/")
fileslist = list.files(pattern = "")
csvFiles = lapply(fileslist, function(x)read.table(x, header = T, sep = ","))
csvFiles = do.call("rbind", csvFiles)
csvFiles |> head()
## chrom start stop type
## 1 chr1 10000 10600 Repetitive/CNV
## 2 chr1 10600 11137 Heterochrom/lo
## 3 chr1 11137 11737 Insulator
## 4 chr1 11737 11937 Weak_Txn
## 5 chr1 11937 12137 Weak_Enhancer
## 6 chr1 12137 14537 Weak_Txn
Importing multiple csv files [2]
And the second way is to import the multiple csv files in a list from our directory using the .map function.
directory_that_holds_files <-("~/Documents/MyWebPage/content/blog/example/my_csv_files/")
chromosomes_list <- directory_that_holds_files %>%
dir_ls() %>%
map(
.f = function(path)read.table(path, header = T, sep = ","))
Binding Rows
And the final step is use bind_rows function to make a final data frame.
chromosomes_tbl <- chromosomes_list %>%
set_names(dir_ls(directory_that_holds_files)) %>%
bind_rows(.id = "file_path")
head(chromosomes_tbl)
## file_path
## 1 /Users/andreasvenizelos/Documents/MyWebPage/content/blog/example/my_csv_files/chr1
## 2 /Users/andreasvenizelos/Documents/MyWebPage/content/blog/example/my_csv_files/chr1
## 3 /Users/andreasvenizelos/Documents/MyWebPage/content/blog/example/my_csv_files/chr1
## 4 /Users/andreasvenizelos/Documents/MyWebPage/content/blog/example/my_csv_files/chr1
## 5 /Users/andreasvenizelos/Documents/MyWebPage/content/blog/example/my_csv_files/chr1
## 6 /Users/andreasvenizelos/Documents/MyWebPage/content/blog/example/my_csv_files/chr1
## chrom start stop type
## 1 chr1 10000 10600 Repetitive/CNV
## 2 chr1 10600 11137 Heterochrom/lo
## 3 chr1 11137 11737 Insulator
## 4 chr1 11737 11937 Weak_Txn
## 5 chr1 11937 12137 Weak_Enhancer
## 6 chr1 12137 14537 Weak_Txn