Circle 1. Falling into the Floating
Once we had crossed the Acheron, we arrived in the first Circle, home of the virtuous pagans. These are people who live in ignorance of the Floating Point Gods. These pagans expect
.1 == .3 / 3to be true. The virtuous pagans will also expect
seq(0, 1, by=.1) == .3to have exactly one value that is true. But you should not expect something like:
unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))to have length one.
— Burns, P. (2011). The R Inferno (§ 1, p. 9). burns-stat.com.
.1 == .3 / 3[1] FALSE
seq(0, 1, by=.1) == .3 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))[1] 0.3 0.3 0.3
Introduction
R is bounded by three fundamental constraints:
- Floating point arithmetic —
.1 == .3 / 3isFALSE - Random Access Memory (RAM) — all data operations require the full dataset in memory
- Computation speed — single-threaded code leaves most of the CPU idle
This session addresses the last two. The modern R toolkit largely solves both: new binary formats serialize data faster and smaller than ever, out-of-memory engines (Arrow, DuckDB) process files larger than RAM without loading them, and packages like collapse and data.table extract the full performance of modern multi-core hardware for in-memory work.
The session is split into two parts:
Part 1 — Serialization
Getting data in and out of R fast.
When you need to cache results between sessions, share large files, or store data too large to keep in RAM, the choice of format matters enormously. Part 1 covers the full spectrum — from base R’s saveRDS() to columnar, compressed, cross-language formats — with benchmarks across 1k to 1M rows and a recommendations table to guide format selection.
Topics covered:
- RDS and qs2 — binary R-native formats that preserve any R object exactly
- fst — columnar binary with random column/row access
- fwrite / fread (data.table) — fastest CSV for sharing with non-R users
- Feather and Parquet (Arrow) — columnar, cross-language, cloud-native
- DuckDB — embedded SQL database, reads Parquet, works larger-than-RAM
Part 2 — Data Wrangling
Computing on data fast — in RAM and beyond.
Once data is in memory, the choice of computation tool determines how fast the analysis runs. More importantly, some operations can stream over data without ever loading all of it into RAM; others must materialize entire columns or partitions. Part 2 explains the distinction and benchmarks the tools against each other.
Topics covered:
- dplyr as the readability baseline
- data.table and collapse for maximum in-memory speed
- Arrow and DuckDB for filtering, aggregation, and joins larger-than-RAM
- Which operations can stream out-of-memory (filter, sum, join) vs. which must materialize in RAM (exact median, window functions, global sort)