Part 1 — Data Serialization in R

How R reads and writes data — RDS, qs2, fst, CSV, Feather, Parquet, and DuckDB — with benchmarks across 1k to 1M rows and practical recommendations for development-data workflows.
R
serialization
qs2
fst
arrow
parquet
duckdb
benchmarks
Author

Eduard Bukin

Published

Invalid Date

What is Serialization?

Do not expect to write data to a file (such as with write.table), read the data back into R and have that be precisely the same as the original. That is doing two translations, and there is often something lost in translation.

— Burns, P. (2011). The R Inferno (§ 8.3.10, p 105). burns-stat.com.

Serialization converts an R object into a byte stream for storage or transmission, then reconstructs it exactly. Unlike text formats (CSV), which convert data to characters and discard type information, binary serialization preserves the complete object: column types, factor levels, class attributes, and custom metadata.

R’s Serialization Toolkit

Seven approaches cover most use cases, grouped by storage strategy.

Binary R-native

Store any R object as bytes; reconstruct exactly with no post-processing.

Package Write / Read Notes
base R saveRDS() / readRDS() Any R object; gzip/bzip2/xz compression; single-threaded
qs2 qs_save() / qs_read() Any R object; zstd/lz4; multi-threaded — 5–10× faster than saveRDS
fst write_fst() / read_fst() Data frames only; LZ4/ZSTD; multi-threaded; random column/row access

Text-based

Convert data to human-readable text; type information is lost and must be inferred on read.

Package Write / Read Notes
base R write.csv() / read.csv() Portable; slow; single-threaded
data.table fwrite() / fread() Same format; 10–100× faster; smart type inference

Cross-platform binary

Use C++ libraries (Apache Arrow, DuckDB) to access R memory directly — zero-copy where possible, language-agnostic output.

Package Write / Read Notes
arrow write_feather() / read_feather() Arrow columnar format; fastest I/O; largest files
arrow write_parquet() / read_parquet() Columnar with compression; column/row selection; cloud-native
duckdb dbWriteTable() / dbReadTable() Embedded SQL database; larger-than-RAM support

Native R Serialization in Practice

Basic usage

df <- data.frame(
  id    = 1:3,
  name  = c("Alice", "Bob", "Charlie"),
  score = c(95.5, 87.3, 92.1)
)
saveRDS(df, "temp_df.rds")
df_restored <- readRDS("temp_df.rds")
identical(df, df_restored)
[1] TRUE

Format and compression options

# Format version 3 supports ALTREP (e.g. 1:1000000 stored as a range, not all ints)
saveRDS(df, "temp_v2.rds", version = 2)
saveRDS(df, "temp_v3.rds", version = 3)
cat("Version 2:", file.size("temp_v2.rds"), "bytes\n")
cat("Version 3:", file.size("temp_v3.rds"), "bytes\n")

# Compression: FALSE = fastest write; "xz" = smallest file
large_df <- data.frame(x = rep(1:100, 100), y = rnorm(10000),
                       z = sample(letters, 10000, replace = TRUE))
saveRDS(large_df, "temp_none.rds",  compress = FALSE)
saveRDS(large_df, "temp_gzip.rds",  compress = "gzip")
saveRDS(large_df, "temp_bzip2.rds", compress = "bzip2")
saveRDS(large_df, "temp_xz.rds",   compress = "xz")
cat("No compression:", file.size("temp_none.rds"),  "bytes\n")
cat("gzip:          ", file.size("temp_gzip.rds"),  "bytes\n")
cat("bzip2:         ", file.size("temp_bzip2.rds"), "bytes\n")
cat("xz:            ", file.size("temp_xz.rds"),   "bytes\n")
Version 2: 169 bytes
Version 3: 217 bytes
No compression: 210203 bytes
gzip:           89287 bytes
bzip2:          85597 bytes
xz:             85008 bytes

Binary vs text encoding

raw_bytes   <- serialize(df, connection = NULL)
ascii_bytes <- serialize(df, connection = NULL, ascii = TRUE)
head(raw_bytes,   16)
head(ascii_bytes, 16)
cat("Binary:", length(raw_bytes),   "bytes\n")
cat("ASCII: ", length(ascii_bytes), "bytes\n")
 [1] 58 0a 00 00 00 03 00 04 05 03 00 03 05 00 00 00
 [1] 41 0a 33 0a 32 36 33 34 32 37 0a 31 39 37 38 38
Binary: 376 bytes
ASCII:  352 bytes

Attributes and types are preserved

# Custom attributes and subclasses survive round-trips
df_special <- df
attr(df_special, "created_date") <- Sys.Date()
attr(df_special, "source") <- "example data"
class(df_special) <- c("my_special_df", "data.frame")
saveRDS(df_special, "temp_special.rds")
attributes(readRDS("temp_special.rds"))

# All column types are preserved exactly
df_types <- data.frame(
  int_col  = 1:3,
  dbl_col  = c(1.1, 2.2, 3.3),
  chr_col  = c("a", "b", "c"),
  fct_col  = factor(c("low", "high", "medium")),
  date_col = as.Date(c("2026-01-01", "2026-01-02", "2026-01-03")),
  stringsAsFactors = FALSE
)
saveRDS(df_types, "temp_types.rds")
str(df_types)
str(readRDS("temp_types.rds"))
$names
[1] "id"    "name"  "score"

$class
[1] "my_special_df" "data.frame"   

$row.names
[1] 1 2 3

$created_date
[1] "2026-06-02"

$source
[1] "example data"

'data.frame':   3 obs. of  5 variables:
 $ int_col : int  1 2 3
 $ dbl_col : num  1.1 2.2 3.3
 $ chr_col : chr  "a" "b" "c"
 $ fct_col : Factor w/ 3 levels "high","low","medium": 2 1 3
 $ date_col: Date, format: "2026-01-01" "2026-01-02" ...
'data.frame':   3 obs. of  5 variables:
 $ int_col : int  1 2 3
 $ dbl_col : num  1.1 2.2 3.3
 $ chr_col : chr  "a" "b" "c"
 $ fct_col : Factor w/ 3 levels "high","low","medium": 2 1 3
 $ date_col: Date, format: "2026-01-01" "2026-01-02" ...

Format Deep Dives

qs2 — fast serialization of any R object

The qs2 package stores any R object (data frames, models, lists, environments) with multi-threaded zstd/lz4 compression — typically 5–10× faster than saveRDS().

library(qs2)

qs_save(large_data, "data.qs2")                  # uses all available cores
qs_save(large_data, "data.qs2", nthreads = 4)    # or specify explicitly
qs2::qs_threads()                                 # check active thread count

# Save a complete analysis workspace — models, predictions, fitted objects
analysis_state <- list(
  model       = trained_rf_model,
  predictions = pred_results,
  metrics     = performance_metrics,
  metadata    = list(timestamp = Sys.time(), r_version = R.version.string)
)
qs_save(analysis_state, "analysis_cache.qs2")
analysis_state <- qs_read("analysis_cache.qs2")   # restore in one call

fst — columnar binary with random access

fst stores data frames in a columnar format that supports partial reads without a full scan — useful when working with wide data frames or large files where you only need a subset of columns or a row range.

library(fst)

write_fst(df, "data.fst", compress = 50)    # 0 = fastest write; 100 = smallest file

df_full    <- read_fst("data.fst")                                         # everything
df_cols    <- read_fst("data.fst", columns = c("region", "income"))        # columns only
df_rows    <- read_fst("data.fst", from = 1000, to = 5000)                 # row range
df_partial <- read_fst("data.fst", columns = c("region", "income"),
                       from = 1000, to = 5000)                             # both

Compression and I/O happen on background threads, so fst can write faster than disk speed — compression and disk writes overlap.

Arrow Parquet — columnar analytics format

Parquet is the standard format for analytical workloads: compressed, columnar, and readable by Python, Spark, BigQuery, DuckDB, and many other tools without conversion.

library(arrow)
library(dplyr)

# Write and read
write_parquet(data_100k, "survey.parquet")
read_parquet("survey.parquet")
read_parquet("survey.parquet", col_select = c("id", "region", "income"))

# Compression options
write_parquet(data_100k, "fast.parquet",       compression = "snappy")     # faster
write_parquet(data_100k, "small.parquet",      compression = "zstd",
              compression_level = 9)                                        # smaller

# Partitioned dataset — splits into region=/year= directory tree
write_dataset(data_100k, path = "survey_parts", partitioning = c("region", "year"))
ds <- open_dataset("survey_parts")

# Filter + aggregate without loading the full file — only matching partitions are read
ds |>
  filter(region == "A", year >= 2023) |>
  group_by(country) |>
  summarise(avg_income = mean(income), n = n()) |>
  collect()

Replace the local path with "s3://bucket/prefix/" or an HTTPS URL and the same code reads from cloud storage — only the requested columns and row groups are downloaded.

DuckDB — embedded analytical database

DuckDB runs as an in-process library (no server), speaks SQL and dplyr, reads Parquet directly, and spills to disk automatically when data exceeds RAM.

library(duckdb)
library(dplyr)

con <- dbConnect(duckdb())              # in-memory
con <- dbConnect(duckdb(), "data.duckdb")  # persistent file

# Write R data frame into DuckDB
dbWriteTable(con, "survey", data_100k)

# SQL query
dbGetQuery(con, "
  SELECT region, AVG(income) AS avg_income, COUNT(*) AS n
  FROM survey WHERE employed = TRUE
  GROUP BY region ORDER BY avg_income DESC
")

# Same query via dplyr
tbl(con, "survey") |>
  filter(employed) |>
  group_by(region) |>
  summarise(avg_income = mean(income), n = n()) |>
  arrange(desc(avg_income)) |>
  collect()

# Query Parquet directly — no import step
dbGetQuery(con, "
  SELECT region, year, AVG(income) AS avg_income
  FROM read_parquet('survey_parts/**/*.parquet')
  WHERE year >= 2023
  GROUP BY region, year
")

dbDisconnect(con, shutdown = TRUE)

Benchmarking

library(tidyverse)
library(data.table)
library(fst)
library(qs2)
library(arrow)
library(duckdb)

set.seed(7892)

generate_test_data <- function(n_rows) {
  data.frame(
    id          = 1:n_rows,
    region      = factor(sample(LETTERS[1:5], n_rows, replace = TRUE)),
    country     = sample(c("Kenya", "Uganda", "Tanzania", "Rwanda", "Burundi",
                           "Ethiopia", "Somalia", "Sudan", "Chad", "Niger",
                           "Nigeria", "Ghana", "Senegal", "Mali", "Burkina Faso",
                           "Benin", "Togo", "Cameroon", "Congo", "DRC"),
                         n_rows, replace = TRUE),
    year        = sample(2020:2025, n_rows, replace = TRUE),
    income      = rlnorm(n_rows, meanlog = 8, sdlog = 1.5),
    employed    = sample(c(TRUE, FALSE), n_rows, replace = TRUE, prob = c(0.6, 0.4)),
    survey_date = as.Date("2020-01-01") + sample(0:2000, n_rows, replace = TRUE),
    notes       = sample(c("Complete", "Partial", "Missing data", NA, "Verified",
                           "Pending review", "Approved"), n_rows, replace = TRUE)
  )
}

data_1k   <- generate_test_data(1e3)
data_10k  <- generate_test_data(1e4)
data_100k <- generate_test_data(1e5)
data_1m   <- generate_test_data(1e6)

data.frame(
  Dataset       = c("1k rows", "10k rows", "100k rows", "1M rows"),
  `Size in RAM` = c(
    format(object.size(data_1k),   units = "MB"),
    format(object.size(data_10k),  units = "MB"),
    format(object.size(data_100k), units = "MB"),
    format(object.size(data_1m),   units = "MB")
  ),
  check.names = FALSE
) |> knitr::kable(caption = "Dataset sizes in memory")
Dataset sizes in memory
Dataset Size in RAM
1k rows 0 Mb
10k rows 0.5 Mb
100k rows 4.6 Mb
1M rows 45.8 Mb

The Write/Read RAM (MB) columns report the additional R-heap memory allocated at the peak of a single operation, measured with gc(). Arrow (feather/parquet) and DuckDB manage large allocations in C++ off the R heap, so their R-side figures understate true peak usage.

bench_dir <- tempfile(); dir.create(bench_dir)

results_1m <- list(
  saveRDS    = bench_write_read(data_1m, function(d,f) saveRDS(d,f), readRDS,     "test.rds",     n_rep = 5),
  qs2        = bench_write_read(data_1m, qs_save,                    qs_read,      "test.qs2",     n_rep = 5),
  fst        = bench_write_read(data_1m, write_fst,                  read_fst,     "test.fst",     n_rep = 5),
  data.table = bench_write_read(data_1m, fwrite,                     fread,        "test.csv",     n_rep = 5),
  feather    = bench_write_read(data_1m, write_feather,              read_feather, "test.feather", n_rep = 5),
  parquet    = bench_write_read(data_1m, write_parquet,              read_parquet, "test.parquet", n_rep = 5)
)
results_1m$duckdb <- bench_duckdb(data_1m, n_rep = 5)

knitr::kable(results_table(results_1m), caption = "Performance comparison — 1M rows")
Performance comparison — 1M rows
Method Write (ms) Write RAM (MB) Read (ms) Read RAM (MB) File Size (MB)
saveRDS 4471.9 0.0 735.8 41.9 12.26
qs2 425.2 0.0 214.1 42.0 12.60
fst 157.3 0.0 128.4 45.8 25.91
data.table 77.6 0.0 150.7 53.5 61.39
feather 251.5 0.9 41.0 12.5 27.98
parquet 817.3 1.5 219.7 12.7 15.48
duckdb 800.8 16.7 126.4 46.3 20.76
unlink(bench_dir, recursive = TRUE)
bench_dir <- tempfile(); dir.create(bench_dir)

results_100k <- list(
  saveRDS    = bench_write_read(data_100k, function(d,f) saveRDS(d,f), readRDS,     "test.rds"),
  qs2        = bench_write_read(data_100k, qs_save,                    qs_read,      "test.qs2"),
  fst        = bench_write_read(data_100k, write_fst,                  read_fst,     "test.fst"),
  data.table = bench_write_read(data_100k, fwrite,                     fread,        "test.csv"),
  feather    = bench_write_read(data_100k, write_feather,              read_feather, "test.feather"),
  parquet    = bench_write_read(data_100k, write_parquet,              read_parquet, "test.parquet")
)
results_100k$duckdb <- bench_duckdb(data_100k)

knitr::kable(results_table(results_100k), caption = "Performance comparison — 100k rows")
Performance comparison — 100k rows
Method Write (ms) Write RAM (MB) Read (ms) Read RAM (MB) File Size (MB)
saveRDS 298.5 0.0 46.0 4.2 1.25
qs2 51.4 0.1 19.4 4.2 1.27
fst 35.7 0.0 16.0 4.6 2.59
data.table 28.5 0.0 19.0 5.5 6.04
feather 30.3 0.9 5.7 2.1 2.80
parquet 65.2 1.5 17.0 2.4 1.86
duckdb 67.5 13.3 12.5 5.0 2.76
unlink(bench_dir, recursive = TRUE)
bench_dir <- tempfile(); dir.create(bench_dir)

results_10k <- list(
  saveRDS    = bench_write_read(data_10k, function(d,f) saveRDS(d,f), readRDS,     "test.rds"),
  qs2        = bench_write_read(data_10k, qs_save,                    qs_read,      "test.qs2"),
  fst        = bench_write_read(data_10k, write_fst,                  read_fst,     "test.fst"),
  data.table = bench_write_read(data_10k, fwrite,                     fread,        "test.csv"),
  feather    = bench_write_read(data_10k, write_feather,              read_feather, "test.feather"),
  parquet    = bench_write_read(data_10k, write_parquet,              read_parquet, "test.parquet")
)
results_10k$duckdb <- bench_duckdb(data_10k)

knitr::kable(results_table(results_10k), caption = "Performance comparison — 10k rows")
Performance comparison — 10k rows
Method Write (ms) Write RAM (MB) Read (ms) Read RAM (MB) File Size (MB)
saveRDS 41.0 0.1 7.4 0.5 0.13
qs2 27.7 0.1 3.9 0.5 0.15
fst 27.5 0.0 4.8 0.4 0.26
data.table 22.2 0.1 5.3 0.6 0.59
feather 22.4 0.8 5.5 1.1 0.30
parquet 35.1 1.4 10.4 1.3 0.19
duckdb 53.7 13.0 3.8 0.9 0.01
unlink(bench_dir, recursive = TRUE)
bench_dir <- tempfile(); dir.create(bench_dir)

results_1k <- list(
  saveRDS    = bench_write_read(data_1k, function(d,f) saveRDS(d,f), readRDS,     "test.rds"),
  qs2        = bench_write_read(data_1k, qs_save,                    qs_read,      "test.qs2"),
  fst        = bench_write_read(data_1k, write_fst,                  read_fst,     "test.fst"),
  data.table = bench_write_read(data_1k, fwrite,                     fread,        "test.csv"),
  feather    = bench_write_read(data_1k, write_feather,              read_feather, "test.feather"),
  parquet    = bench_write_read(data_1k, write_parquet,              read_parquet, "test.parquet")
)
results_1k$duckdb <- bench_duckdb(data_1k)

knitr::kable(results_table(results_1k), caption = "Performance comparison — 1k rows")
Performance comparison — 1k rows
Method Write (ms) Write RAM (MB) Read (ms) Read RAM (MB) File Size (MB)
saveRDS 19.5 0.0 3.9 0.0 0.01
qs2 20.6 0.0 2.2 0.0 0.02
fst 26.2 0.0 2.9 0.0 0.03
data.table 27.5 0.1 4.6 0.2 0.06
feather 20.9 0.8 5.0 1.0 0.03
parquet 24.8 1.4 5.2 1.2 0.02
duckdb 38.8 13.0 1.4 0.5 0.01
unlink(bench_dir, recursive = TRUE)

Summary & Recommendations

Choose based on your primary use case. Rows are properties; columns are tools.

Table 1: Format comparison across key properties
saveRDS qs2 fst fwrite / fread feather parquet DuckDB
Write speed slow fast fast fast fastest moderate moderate
Read speed slow fast fastest fast fastest fast moderate
File size small small medium largest large smallest medium
Compressed yes yes optional no no yes yes
Multi-threaded no yes yes yes yes yes yes
Stores any R object df only df only df only df only df only
Select columns on read
Select rows on read range only ✓ SQL
Cross-language R only R only R only any text tool Arrow Python/Spark/etc. any
Larger-than-RAM ✓ stream ✓ spill
Best for models, complex R objects fast model/list caching large df with partial reads CSV exchange with non-R users cross-language fast exchange big data, cloud analytics SQL workflows, out-of-memory

Read More

Package documentation

Benchmarks

Background reading

  • Burns, P. (2011). The R Inferno. burns-stat.com — § 8 covers numeric storage and write/read pitfalls that motivate binary serialization.