When the Database Brings Its Own Forklift
A systems-level look at why DuckDB outperformed Polars on a 1TB workload and what it taught me about execution engines, work avoidance, and building for scale.

Image courtesy Gemini
While reading a benchmark showing DuckDB outperforming Polars on a 1TB dataset, I had a familiar systems déjà vu: this wasn’t really about SQL versus DataFrames. It was about where work actually happens once scale stops being theoretical and starts being painful.
I’ve seen this pattern repeat across domains distributed storage, query engines, and even RAG pipelines. The systems that win at scale aren’t the ones that execute faster loops. They’re the ones that avoid doing unnecessary work altogether.
This post is my attempt to explain why DuckDB pulled ahead in that benchmark not as a performance flex, but as a systems lesson that keeps showing up across the stack.
The Benchmark Is a Red Herring (But a Useful One)
At a surface level, the result feels provocative:
DuckDB processes 1TB-scale analytics queries faster than Polars.
Benchmarks like this are tempting to read as verdicts. But they rarely explain why a system wins only where it happened to win.
The more interesting question isn’t which tool is faster. It’s what architectural assumptions start to dominate once data stops fitting in memory.
That’s where this benchmark becomes useful.
When 1TB Changes the Rules
At smaller scales, Polars feels unbeatable:
- Columnar in-memory layout
- Lazy execution
- SIMD-friendly operators
As long as your data fits comfortably in RAM, these choices shine.
But 1TB changes the game entirely.
At that scale:
- Memory becomes a constrained, contested resource
- Cache locality matters more than raw CPU speed
- Disk access patterns dominate overall performance
This is where DuckDB quietly switches modes.
A Mental Model That Helped Me
I think of the difference like this:
- Polars is a sports car: incredibly fast on clean roads.
- DuckDB is a forklift: slower top speed, but built to move heavy pallets all day without collapsing.
A 1TB workload isn’t a racetrack. It’s a warehouse.
DuckDB’s Real Advantage: Doing Work Before Data Moves
DuckDB isn’t just executing SQL. It’s aggressively rewriting your intent into the cheapest possible plan.
Several design choices compound here.
Vectorized Execution With Discipline
Both DuckDB and Polars use vectorized execution. The difference is how tightly DuckDB couples vectorization with:
- Predicate pushdown
- Late materialization
- Operator fusion
In practice, this means rows that will be filtered out often never get fully loaded. Entire chunks of data are skipped before they become expensive.
In a DataFrame model, even a lazy one, once data is materialized you’ve already paid a significant I/O and memory cost.
DuckDB avoids that bill.
A Concrete Example From My DLQS Project
This benchmark felt familiar because I ran into the same tradeoffs while building a distributed log storage and query system.
The logs lived on disk, not in memory. Typical queries looked like:
- “Count events by type over the last N hours”
- “Aggregate latency percentiles across millions of records”
When I embedded DuckDB for analytics, the surprising part wasn’t raw speed, it was how little data it touched.
Filters on timestamps and event types were pushed down so aggressively that:
- Large portions of log segments were skipped entirely
- Aggregations streamed through vectorized batches
- Memory usage stayed flat even as data volume grew
At that point, DuckDB stopped feeling like “a SQL engine” and started feeling like part of the storage layer itself.
That same instinct shows up clearly in the 1TB benchmark.
What I Got Wrong Initially
Early on, I underestimated how much execution planning matters once data stops fitting in memory.
My initial assumption shaped by smaller-scale experiments was:
As long as execution is vectorized and lazy, the rest is just plumbing.
That turned out to be wrong.
Lazy execution helps delay work.
Query planning decides whether that work happens at all.
At scale, avoiding work beats doing work fast every single time.
The Counterfactual: Forcing Polars Into This World
If I had tried to force Polars into my DLQS analytics path or into this 1TB workload, this is where it would have cracked:
- Large joins would still require materializing substantial intermediate state
- Disk-backed data would need to be staged into memory-resident frames
- Memory pressure would show up as performance cliffs instead of graceful degradation
Polars would still be fast but fast at doing work that, architecturally, shouldn’t exist.
That’s not a flaw.
It’s a signal that the workload has crossed an abstraction boundary.
A Distributed Systems Parallel
This benchmark reminded me of a lesson I learned the hard way in distributed systems:
Throughput doesn’t come from faster workers. It comes from fewer messages.
DuckDB wins here because it:
- Pushes filters down early
- Avoids materializing junk
- Executes operators in cache-friendly batches
- Minimizes memory churn
That’s the same principle behind:
- Log-structured storage
- Query planners in distributed databases
- Retrieval pruning in RAG systems
Different domains. Same law.
Abstractions Leak at Scale
At small scales:
- “SQL vs DataFrames” feels like an API preference
- Performance differences hide behind hardware
At large scales:
- Abstractions leak
- Execution models matter
- Systems thinking beats syntax
DuckDB’s abstraction leaks less at 1TB because it was built assuming:
- Disk exists
- Memory is finite
- Users will ask hard questions of large data
Conclusion
This wasn’t really about DuckDB versus Polars.
It was about:
- Execution engines versus data structures
- Planned work versus eager work
- Systems designed for scale versus systems designed for speed
When your data is small, choose joy.
When your data is large, choose discipline.
DuckDB brought a forklift to a warehouse problem and that made all the difference.