Fast & Big Data in R

Circle 1. Falling into the Floating

Once we had crossed the Acheron, we arrived in the first Circle, home of the virtuous pagans. These are people who live in ignorance of the Floating Point Gods. These pagans expect

.1 == .3 / 3

to be true. The virtuous pagans will also expect

seq(0, 1, by=.1) == .3

to have exactly one value that is true. But you should not expect something like:

unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))

to have length one.

— Burns, P. (2011). The R Inferno (§ 1, p. 9). burns-stat.com.

Show code results

.1 == .3 / 3

[1] FALSE

seq(0, 1, by=.1) == .3

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))

[1] 0.3 0.3 0.3

Introduction

R is bounded by three fundamental constraints:

Floating point arithmetic — .1 == .3 / 3 is FALSE
Random Access Memory (RAM) — all data operations require the full dataset in memory
Computation speed — single-threaded code leaves most of the CPU idle

This session addresses the last two. The modern R toolkit largely solves both: new binary formats serialize data faster and smaller than ever, out-of-memory engines (Arrow, DuckDB) process files larger than RAM without loading them, and packages like collapse and data.table extract the full performance of modern multi-core hardware for in-memory work.

The session is split into two parts:

Part 1 — Serialization

Getting data in and out of R fast.

When you need to cache results between sessions, share large files, or store data too large to keep in RAM, the choice of format matters enormously. Part 1 covers the full spectrum — from base R’s saveRDS() to columnar, compressed, cross-language formats — with benchmarks across 1k to 1M rows and a recommendations table to guide format selection.

Topics covered:

RDS and qs2 — binary R-native formats that preserve any R object exactly
fst — columnar binary with random column/row access
fwrite / fread (data.table) — fastest CSV for sharing with non-R users
Feather and Parquet (Arrow) — columnar, cross-language, cloud-native
DuckDB — embedded SQL database, reads Parquet, works larger-than-RAM

→ Read Part 1: Data Serialization in R

Part 2 — Data Wrangling

Computing on data fast — in RAM and beyond.

Once data is in memory, the choice of computation tool determines how fast the analysis runs. More importantly, some operations can stream over data without ever loading all of it into RAM; others must materialize entire columns or partitions. Part 2 explains the distinction and benchmarks the tools against each other.

Topics covered:

dplyr as the readability baseline
data.table and collapse for maximum in-memory speed
Arrow and DuckDB for filtering, aggregation, and joins larger-than-RAM
Which operations can stream out-of-memory (filter, sum, join) vs. which must materialize in RAM (exact median, window functions, global sort)

→ Read Part 2: Fast Data Wrangling in R

--- title: "Fast & Big Data in R" title-long: "Fast & Big Data in R: Serialization, collapse, and Out-of-Memory Computing" date: "2026-06-DD" # TODO: confirm seminar date description: "A fast tour of the R toolkit for development data work: serializing objects and data frames (RDS, qs2, fst), lightning-fast in-memory computation with collapse, and working with data larger than RAM using Parquet/Arrow and DuckDB." author: "Eduard Bukin" # TODO: confirm speaker categories: [R, performance, data, collapse, arrow, duckdb, serialization] # image: "images/cover.png" code-tools: true code-fold: false --- ```{css, echo=FALSE} .collapsible-code-toggle { background-color: #f8f9fa; border: 1px solid #dee2e6; border-radius: 4px; padding: 8px 12px; cursor: pointer; display: inline-block; margin-bottom: 10px; font-size: 0.9em; color: #495057; transition: all 0.2s ease; } .collapsible-code-toggle:hover { background-color: #e9ecef; border-color: #adb5bd; } .collapsible-code-toggle::before { content: "▶ "; display: inline-block; transition: transform 0.2s ease; } .collapsible-code-toggle[aria-expanded="true"]::before { transform: rotate(90deg); } .collapsible-code-content { margin-top: 10px; } ``` #### Circle 1. Falling into the Floating {.unnumbered .unlisted style="font-family: 'Times New Roman', serif; font-size: 0.85em;"} > Once we had crossed the Acheron, we arrived in the first Circle, home of the > virtuous pagans. These are people who live in ignorance of the Floating Point > Gods. These pagans expect > > `.1 == .3 / 3` > > to be true. > The virtuous pagans will > also expect > > `seq(0, 1, by=.1) == .3` > > to have exactly one value that is true. > But you should not expect something like: > > > `unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))` > > to have length one. > > — Burns, P. (2011). *The R Inferno* (§ 1, p. 9). burns-stat.com. ::: {.collapsible-code-wrapper} <a class="collapsible-code-toggle" data-bs-toggle="collapse" href="#floating-point-results" role="button" aria-expanded="false" aria-controls="floating-point-results"> Show code results </a> ::: {#floating-point-results .collapse .collapsible-code-content} ```{r} #| label: floating-point-trap .1 == .3 / 3 seq(0, 1, by=.1) == .3 unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4)) ``` ::: ::: ## Introduction R is bounded by three fundamental constraints: 1. Floating point arithmetic — `.1 == .3 / 3` is `FALSE` 2. Random Access Memory (RAM) — all data operations require the full dataset in memory 3. Computation speed — single-threaded code leaves most of the CPU idle This session addresses the last two. The modern R toolkit largely solves both: new binary formats serialize data faster and smaller than ever, out-of-memory engines (Arrow, DuckDB) process files larger than RAM without loading them, and packages like `collapse` and `data.table` extract the full performance of modern multi-core hardware for in-memory work. The session is split into two parts: ## Part 1 — Serialization **Getting data in and out of R fast.** When you need to cache results between sessions, share large files, or store data too large to keep in RAM, the choice of format matters enormously. Part 1 covers the full spectrum — from base R's `saveRDS()` to columnar, compressed, cross-language formats — with benchmarks across 1k to 1M rows and a recommendations table to guide format selection. Topics covered: - **RDS** and **qs2** — binary R-native formats that preserve any R object exactly - **fst** — columnar binary with random column/row access - **fwrite / fread** (data.table) — fastest CSV for sharing with non-R users - **Feather** and **Parquet** (Arrow) — columnar, cross-language, cloud-native - **DuckDB** — embedded SQL database, reads Parquet, works larger-than-RAM [→ Read Part 1: Data Serialization in R](serialization.qmd){.btn .btn-outline-primary .btn-sm} ## Part 2 — Data Wrangling **Computing on data fast — in RAM and beyond.** Once data is in memory, the choice of computation tool determines how fast the analysis runs. More importantly, some operations can stream over data without ever loading all of it into RAM; others must materialize entire columns or partitions. Part 2 explains the distinction and benchmarks the tools against each other. Topics covered: - **dplyr** as the readability baseline - **data.table** and **collapse** for maximum in-memory speed - **Arrow** and **DuckDB** for filtering, aggregation, and joins larger-than-RAM - Which operations *can* stream out-of-memory (filter, sum, join) vs. which *must* materialize in RAM (exact median, window functions, global sort) [→ Read Part 2: Fast Data Wrangling in R](data-wrangling.qmd){.btn .btn-outline-primary .btn-sm}