Filter 5GB CSV file in R

Question:

I'm trying by all means to tinker with a 5GB file for my monograph.

The code I'm trying to get into R is as follows:

> write.csv(subset(read.csv("enem.csv", header=TRUE), UF_ESC=="SP"), "filtro.csv", row.nomes=FALSE)

I've already tested it with a template file and it worked, but not in the original database. I imagine it's because of the size, because he keeps thinking and nothing.

If anyone has any other ideas to help it would be a huge help.

Answer:

Here are some tips to solve your problem:

1 – Read the file using another function:

> microbenchmark(
+   base = read.csv(file = "df-write-csv.csv", header = T),
+   readr = readr::read_csv("df-write-csv.csv"),
+   data.table = data.table::fread("df-write-csv.csv"),
+   rio = rio::import("df-write-csv.csv", format = "csv")
+ )
Unit: microseconds
       expr      min        lq      mean    median        uq      max neval
       base 1836.230 1912.1815 2253.6071 1980.3995 2282.1675 4148.787   100
      readr  823.960  881.3625 1072.4790  921.6605 1120.2365 3538.359   100
 data.table  327.759  364.4810  442.5933  402.3295  458.7895  920.436   100
        rio  312.317  351.2260  444.1087  382.9325  439.7960 2938.490   100

See that reading the file with the fread function of data.table or with the import function of the rio is 4x faster than with the native function of R. Check if you were really able to read.

2 – Check that you really managed to filter your database. Save the subset result in a helper object. If this is the problem, try filtering using functions from other packages like dplyr or data.table .

When it's long operations, data.table can be much faster.

> df <- data.frame(x = 1:100000, y = 1:100000, l = sample(letters, size = 100, replace = T))
> microbenchmark(
+   base = subset(df, l == "a"),
+   dplyr = dplyr::filter(df, l == "a"),
+   data.table = data.table(df)[l == "a",]
+ )
Unit: milliseconds
       expr       min        lq      mean    median        uq      max neval
       base 10.329514 12.467143 14.962479 13.976907 17.171858  24.3761   100
      dplyr  7.331626  8.624356 10.063947  8.853807 11.140871  16.8939   100
 data.table  2.986519  4.580536  6.774548  4.824227  5.957255 119.9709   100

3 – Use the write_csv function of the readr package it is more or less 2x faster than the native R write.csv function.

microbenchmark(
  base = write.csv(df, file = "df-write-csv.csv", row.names = F),
  rio = rio::export(df, file = "df-rio.csv", format = "csv"),
  readr = readr::write_csv(df, path = "df-readr.csv")
)

Unit: microseconds
  expr     min       lq     mean    median       uq      max neval
  base 713.564 1097.534 2025.377 1467.4980 2996.136 4168.352   100
   rio 718.141 1156.998 2243.143 2011.5310 3106.479 7368.046   100
 readr 366.306  594.629 1265.297  734.0445 1793.405 5852.142   100

Anyway, if you were able to read the 5GB file, it is very likely that you will also be able to write it, since it is already in your computer's RAM.

Scroll to Top