Fast & Big Data in R

A fast tour of the R toolkit for development data work: serializing objects and data frames (RDS, qs2, fst), lightning-fast in-memory computation with collapse, and working with data larger than RAM using Parquet/Arrow and DuckDB.
R
performance
data
collapse
arrow
duckdb
serialization
Author

Eduard Bukin

Published

Invalid Date

Circle 1. Falling into the Floating

Once we had crossed the Acheron, we arrived in the first Circle, home of the virtuous pagans. These are people who live in ignorance of the Floating Point Gods. These pagans expect

.1 == .3 / 3

to be true. The virtuous pagans will also expect

seq(0, 1, by=.1) == .3

to have exactly one value that is true. But you should not expect something like:

unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))

to have length one.

— Burns, P. (2011). The R Inferno (§ 1, p. 9). burns-stat.com.

.1 == .3 / 3
[1] FALSE
seq(0, 1, by=.1) == .3
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))
[1] 0.3 0.3 0.3

Introduction

R is bounded by three fundamental constraints:

  1. Floating point arithmetic — .1 == .3 / 3 is FALSE
  2. Random Access Memory (RAM) — all data operations require the full dataset in memory
  3. Computation speed — single-threaded code leaves most of the CPU idle

This session addresses the last two. The modern R toolkit largely solves both: new binary formats serialize data faster and smaller than ever, out-of-memory engines (Arrow, DuckDB) process files larger than RAM without loading them, and packages like collapse and data.table extract the full performance of modern multi-core hardware for in-memory work.

The session is split into two parts:

Part 1 — Serialization

Getting data in and out of R fast.

When you need to cache results between sessions, share large files, or store data too large to keep in RAM, the choice of format matters enormously. Part 1 covers the full spectrum — from base R’s saveRDS() to columnar, compressed, cross-language formats — with benchmarks across 1k to 1M rows and a recommendations table to guide format selection.

Topics covered:

  • RDS and qs2 — binary R-native formats that preserve any R object exactly
  • fst — columnar binary with random column/row access
  • fwrite / fread (data.table) — fastest CSV for sharing with non-R users
  • Feather and Parquet (Arrow) — columnar, cross-language, cloud-native
  • DuckDB — embedded SQL database, reads Parquet, works larger-than-RAM

→ Read Part 1: Data Serialization in R

Part 2 — Data Wrangling

Computing on data fast — in RAM and beyond.

Once data is in memory, the choice of computation tool determines how fast the analysis runs. More importantly, some operations can stream over data without ever loading all of it into RAM; others must materialize entire columns or partitions. Part 2 explains the distinction and benchmarks the tools against each other.

Topics covered:

  • dplyr as the readability baseline
  • data.table and collapse for maximum in-memory speed
  • Arrow and DuckDB for filtering, aggregation, and joins larger-than-RAM
  • Which operations can stream out-of-memory (filter, sum, join) vs. which must materialize in RAM (exact median, window functions, global sort)

→ Read Part 2: Fast Data Wrangling in R