Question:
I'm trying by all means to tinker with a 5GB file for my monograph.
The code I'm trying to get into R is as follows:
> write.csv(subset(read.csv("enem.csv", header=TRUE), UF_ESC=="SP"), "filtro.csv", row.nomes=FALSE)
I've already tested it with a template file and it worked, but not in the original database. I imagine it's because of the size, because he keeps thinking and nothing.
If anyone has any other ideas to help it would be a huge help.
Answer:
Here are some tips to solve your problem:
1 – Read the file using another function:
> microbenchmark(
+ base = read.csv(file = "df-write-csv.csv", header = T),
+ readr = readr::read_csv("df-write-csv.csv"),
+ data.table = data.table::fread("df-write-csv.csv"),
+ rio = rio::import("df-write-csv.csv", format = "csv")
+ )
Unit: microseconds
expr min lq mean median uq max neval
base 1836.230 1912.1815 2253.6071 1980.3995 2282.1675 4148.787 100
readr 823.960 881.3625 1072.4790 921.6605 1120.2365 3538.359 100
data.table 327.759 364.4810 442.5933 402.3295 458.7895 920.436 100
rio 312.317 351.2260 444.1087 382.9325 439.7960 2938.490 100
See that reading the file with the fread
function of data.table
or with the import
function of the rio
is 4x faster than with the native function of R. Check if you were really able to read.
2 – Check that you really managed to filter your database. Save the subset result in a helper object. If this is the problem, try filtering using functions from other packages like dplyr
or data.table
.
When it's long operations, data.table
can be much faster.
> df <- data.frame(x = 1:100000, y = 1:100000, l = sample(letters, size = 100, replace = T))
> microbenchmark(
+ base = subset(df, l == "a"),
+ dplyr = dplyr::filter(df, l == "a"),
+ data.table = data.table(df)[l == "a",]
+ )
Unit: milliseconds
expr min lq mean median uq max neval
base 10.329514 12.467143 14.962479 13.976907 17.171858 24.3761 100
dplyr 7.331626 8.624356 10.063947 8.853807 11.140871 16.8939 100
data.table 2.986519 4.580536 6.774548 4.824227 5.957255 119.9709 100
3 – Use the write_csv
function of the readr
package it is more or less 2x faster than the native R write.csv
function.
microbenchmark(
base = write.csv(df, file = "df-write-csv.csv", row.names = F),
rio = rio::export(df, file = "df-rio.csv", format = "csv"),
readr = readr::write_csv(df, path = "df-readr.csv")
)
Unit: microseconds
expr min lq mean median uq max neval
base 713.564 1097.534 2025.377 1467.4980 2996.136 4168.352 100
rio 718.141 1156.998 2243.143 2011.5310 3106.479 7368.046 100
readr 366.306 594.629 1265.297 734.0445 1793.405 5852.142 100
Anyway, if you were able to read the 5GB file, it is very likely that you will also be able to write it, since it is already in your computer's RAM.