提起“大数据”,许多人脑海中浮现的第一反应无非是:分布式、Hadoop、数据仓库及复杂的数据湖架构。十多年来,这些词汇塑造了我们对数据处理的认知,也引导了整个行业的基础设施选择。然而,DuckDB 却直言——我们被“大数据”忽悠了整整十年。
到底是哪里出了问题?为什幺小数据时代反而被误判为“大数据”战场?DuckDB 官方博客在一篇名为《小数据的失落十年,The Lost Decade of Small Data?》(https://duckdb.org/2025/05/19/the-lost-decade-of-small-data.html)的文章中抛出了大胆质疑与深刻反思。
以下内容为翻译该文,带你一起重新审视那场“数据规模幻象”的十年旅程。
TL;DR: We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?
TL;DR:我们在 2012 年的 MacBook Pro 上对 DuckDB 进行了基准测试,以决定:我们是否在追逐数据分析的分布式架构上浪费了十年时间?
Much has been said, not in the very least by ourselves, about how data is actually not that “Big” and how the speed of hardware innovation is outpacing the growth of useful datasets. We may have gone so far to predict a data singularity in the near future, where 99% of useful datasets can be comfortably queried on a single node. As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.
关于数据其实并没有那么“大”,以及硬件创新速度如何超过有用数据集的增长速度,人们(尤其是我们自己)已经讨论过很多次。我们甚至可能预测在不久的将来会出现数据奇点,届时99%的有用数据集可以在单个节点上轻松查询。正如最近显示的那样,Amazon Redshift 和 Snowflake 中的中位数扫描读取的数据量为 100 MB,而 99.9% 的读取量不到 300 GB。因此,奇点可能比我们想象的更近。
But we started wondering, when did this development really start? When did personal computers like the ubiquitous MacBook Pro, usually condemned to running Chrome, become the data processing powerhouses that they really are today?
但我们不禁要问,这种发展究竟是从什么时候开始的?像 MacBook Pro 这样随处可见、通常只能运行 Chrome 浏览器的个人电脑,又是什么时候发展成为如今数据处理能力如此强大的?
Let's turn our attention to the 2012 Retina MacBook Pro, a computer many people (myself included) bought at the time because of its gorgeous “Retina” display. Millions were sold. Despite being unemployed at the time, I had even splurged for the 16 GB RAM upgrade. But there was another often-forgotten revolutionary change in this machine: it was the first MacBook with a built-in Solid-State Disk (SSD) and a competitive 4-core 2.6 GHz “Core i7” CPU. It's funny to watch the announcement again, where they do stress the performance aspect of the “all-flash architecture” as well.
让我们把目光转向2012年的 Retina MacBook Pro,当时很多人(包括我自己)都因为它华丽的“Retina”显示屏而购买了这款电脑。它销量达数百万台。尽管当时我失业了,但我甚至挥霍了一笔钱升级了16GB的内存。但这台电脑还有另一个经常被遗忘的革命性变化:它是第一款内置固态硬盘 (SSD)和极具竞争力的4核2.6 GHz“Core i7” CPU的MacBook。再次观看发布会的画面很有意思,他们也确实强调了“全闪存架构”的性能。
Side note: the MacBook Air was actually the first MacBook with an (optional) built-in SSD already back in 2008. But it did not have the CPU firepower of the Pro, sadly.
附注:MacBook Air 实际上是第一款配备(可选)内置 SSD 的 MacBook,早在 2008 年就已推出。但遗憾的是,它不具备 Pro 的 CPU 火力。
Coincidentally, I still have this laptop in the DuckDB Labs office, currently used by my kids to type their names in a massive font size or watch Bluey on YouTube when they're around. But can this relic still run modern-day DuckDB? How will its performance compare to modern MacBooks? And could we have had the data revolution that we are seeing now already back in 2012? Let's find out!
巧合的是,我的DuckDB 实验室办公室里还有这台笔记本电脑,孩子们现在用它来用大字体输入他们的名字,或者在他们身边的时候在 YouTube 上看《Bluey》节目。但这台旧电脑还能运行现代的 DuckDB 吗?它的性能与现代 MacBook 相比如何?我们可能在 2012 年就经历了如今的数据革命吗?让我们来一探究竟!
软件(Software)
First, what about the operating system? In order to make the comparison fair(er) to the decades, we actually downgraded the operating system on the Retina to OS X 10.8.5 “Mountain Lion”, the operating system version that shipped just a few weeks after the laptop itself in July 2012. Even though the Retina can actually run 10.15 (Catalina), we felt a true 2012 comparison should also use an operating system from the era. Below is a screenshot of the user interface for those of us who sometimes feel a little old.
首先,操作系统怎么样?为了让对比更公平地反映十年前的差异,我们实际上将 Retina 的操作系统降级到了 OS X 10.8.5 “Mountain Lion”,这个版本是在 2012 年 7 月,比这款笔记本电脑发布仅晚了几周才发布的。虽然 Retina 可以运行 10.15(Catalina),但我们认为,要真正与 2012 年的对比,也应该使用那个时代的操作系统。下面是用户界面的截图,方便我们这些有时会觉得有些过时的人查看。
Moving on to DuckDB itself: here at DuckDB we are more than a little religious about portability and dependencies – or rather the lack thereof. This means that very little had to happen to make DuckDB run on the ancient Mountain Lion: the stock DuckDB binary is built with by default with backwards-compatibility to OS X 11.0 (Big Sur), but simply changing the flag and recompiling turned out to be enough to make DuckDB 1.2.2 run on Mountain Lion. We would have loved to also use a 2012 compiler to build DuckDB, but, alas, C++ 11 was unsurprisingly simply too new in 2012 to be fully supported by compilers. Either way, the binary runs fine and could have been also produced by working around the compiler bugs. Or we could have just hand-coded Assembly like others have done.
继续讨论 DuckDB 本身:在 DuckDB,我们对可移植性和依赖性非常执着 —— 或者更确切地说,是缺乏可移植性和依赖性。这意味着几乎不需要做任何改动就可以让 DuckDB 在古老的 Mountain Lion 上运行:默认构建的 DuckDB 二进制文件向后兼容 OS X 11.0 (Big Sur),但只需更改标志并重新编译就足以让 DuckDB 1.2.2 在 Mountain Lion 上运行。我们也希望使用 2012 年的编译器来构建 DuckDB,但遗憾的是,C++ 11在 2012 年太新了,编译器无法完全支持它。无论如何,二进制文件运行良好,也可以通过解决编译器错误来生成。或者我们可以像其他人一样手工编写汇编代码。
基准测试(Benchmarks)
But we're not interested in synthetic CPU scores, we're interested in synthetic SQL scores instead! To see how the old machine is holding up when performing serious data crunching, we used the at this point rather tired but well-known TPC-H benchmark at scale factor 1000. This means that the two main tables, lineitem and orders contain 6 and 1.5 Billion rows, respectively. When stored as a DuckDB database, the database has a size of ca. 265 GB.
但我们感兴趣的不是综合 CPU 得分,而是综合 SQL 得分!为了测试这台老机器在执行大规模数据处理时的表现,我们使用了目前虽然略显老旧但广为人知的 TPC-H 基准测试,其规模因子为 1000。这意味着两个主表lineitem和orders分别包含 60 亿行和 15 亿行数据。当存储为 DuckDB 数据库时,该数据库的大小约为 265 GB。
From the audited results on the TPC website, we can see that running the benchmark on this scale factor on a single node seems to require hardware costing hundreds of thousands of Dollars.
从TPC网站上的审计结果我们可以看出,在单个节点上运行这个规模的基准测试似乎需要花费数十万美元的硬件。
We ran each of the 22 benchmark queries five times, and took the median runtime to remove noise. However, because the amount of RAM (16 GB) is very much less than the database size (256 GB), no significant amount of the input data can be cached in the buffer manager, so those are not really what people sometimes call “hot” runs.
我们对 22 个基准查询分别运行了五次,并取中位运行时间以消除干扰。然而,由于 RAM 大小(16 GB)远小于数据库大小(256 GB),缓冲区管理器中无法缓存大量的输入数据,因此这些运行实际上并非人们所说的“热”运行。
Below are the per-query results in seconds:
以下是每个查询的结果(以秒为单位):
TPC-H-SQL | time |
1 | 142.2 |
2 | 23.2 |
3 | 262.7 |
4 | 167.5 |
5 | 185.9 |
6 | 127.7 |
7 | 278.3 |
8 | 248.4 |
9 | 675.0 |
10 | 1266.1 |
11 | 33.4 |
12 | 161.7 |
13 | 384.7 |
14 | 215.9 |
15 | 197.6 |
16 | 100.7 |
17 | 243.7 |
18 | 2076.1 |
19 | 283.9 |
20 | 200.1 |
21 | 1011.9 |
22 | 57.7 |
But what do those cold numbers actually mean? The hidden sensation is that we actually have numbers, this old computer could actually complete all benchmark queries using DuckDB! If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way. Heck, you would have been waiting way longer back in 2012 for Hadoop YARN to pick up your job in the first place only to spew stack traces at you at some point.
但这些冰冷的数字究竟意味着什么?我们内心深处的感受是,我们真的掌握了数据,这台老电脑居然能用 DuckDB 完成所有基准测试查询!如果我们仔细观察一下时间,就会发现查询耗时在一分钟到半小时之间。对于这类数据的分析查询来说,这样的等待时间并不算不合理。哎呀,要是在 2012 年,你肯定要等更长时间才能等到 Hadoop YARN 接手你的任务,结果却在某个时刻向你喷涌而出的堆栈跟踪信息。
2023 年改进(Improvements)
But how do those results stack up against a modern MacBook? As a comparison point, we used a modern ARM-based M3 Max MacBook Pro, which happened to be sitting on the same desk. But between them, the two MacBooks represent more than a decade of hardware development.
但这些结果与现代 MacBook 相比如何呢?为了进行比较,我们使用了一台现代的 ARM M3 Max MacBook Pro,它恰好放在同一张桌子上。但这两款 MacBook 加起来代表了十多年的硬件发展历程。
Looking at GeekBench 5 benchmark scores alone, we see a ca. 7× difference in raw CPU speed when using all cores, and ca. factor 3 difference in single-core speed. Of course there are also big differences in RAM and SSD speeds. Funnily, the display size and resolution are almost unchanged.
仅从GeekBench 5 的基准测试成绩来看,我们发现在使用所有核心的情况下,原始 CPU 速度大约相差 7 倍,单核速度大约相差 3 倍。当然,RAM 和 SSD 的速度也有很大差异。有趣的是,显示屏尺寸和分辨率几乎没有变化。
Here are the results side-by-side:
以下是并排的结果:
TPC-H-SQL | time_2012 | time_2023 | 加速倍数 |
1 | 142.2 | 19.6 | 7.26 |
2 | 23.2 | 2.0 | 11.60 |
3 | 262.7 | 21.8 | 12.05 |
4 | 167.5 | 11.1 | 15.09 |
5 | 185.9 | 15.5 | 11.99 |
6 | 127.7 | 6.6 | 19.35 |
7 | 278.3 | 14.9 | 18.68 |
8 | 248.4 | 14.5 | 17.13 |
9 | 675.0 | 33.3 | 20.27 |
10 | 1266.1 | 23.6 | 53.65 |
11 | 33.4 | 2.2 | 15.18 |
12 | 161.7 | 10.1 | 16.01 |
13 | 384.7 | 24.4 | 15.77 |
14 | 215.9 | 9.2 | 23.47 |
15 | 197.6 | 8.2 | 24.10 |
16 | 100.7 | 4.1 | 24.56 |
17 | 243.7 | 15.3 | 15.93 |
18 | 2076.1 | 47.6 | 43.62 |
19 | 283.9 | 23.1 | 12.29 |
20 | 200.1 | 10.9 | 18.36 |
21 | 1011.9 | 47.8 | 21.17 |
22 | 57.7 | 4.3 | 13.42 |
We do see significant speedups, from 7 up to as much as 53. The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.
我们确实看到了显著的加速,从 7 上升到 53。时间的几何平均值从 218 提高到 12,提高了约 20 倍。
复现(Reproducibility)
The binary, scripts, queries, and results are available on GitHub for inspection. We also made the TPC-H SF1000 database file available for download so you don't have to generate it. But be warned, it's a large file.
二进制文件、脚本、查询和结果均可在 GitHub 上查看。我们还提供了TPC-H SF1000 数据库文件供下载,您无需自行生成。但请注意,该文件很大。
- https://github.com/hannes/old-macbook-tpch
- http://blobs.duckdb.org/data/tpch-sf1000.db
讨论(Discussion)
We have seen how the decade-old MacBook Pro Retina has been able to complete a complex analytical benchmark. A newer laptop was able to significantly improve on those times. But absolute speedup numbers are a bit pointless here. The difference is purely quantitative, not qualitative.
我们已经见证了十年前的 MacBook Pro Retina 是如何完成一项复杂的分析基准测试的。新款笔记本电脑能够显著提升这些时间。但绝对的加速数字在这里毫无意义。差异纯粹是量变,而非质变。
From a user perspective, it matters much more that those queries complete in somewhat reasonable time, not if it took 10 or 100 seconds to do so. We can tackle almost the same kind of data problems with both laptops, we just have to be willing to wait a little longer. This is especially true given DuckDB's out-of-core capability, which allows it to spill query intermediates to disks if required.
从用户的角度来看,查询在合理的时间内完成比耗时 10 秒或 100 秒更重要。我们可以用两台笔记本电脑处理几乎相同的数据问题,只是需要愿意多等一会儿。考虑到 DuckDB 的核外(out-of-core)功能,这一点尤其重要,该功能允许它在需要时将查询中间结果溢出到磁盘。
What is perhaps more interesting is that back in 2012, it would have been completely feasible to have a single-node SQL engine like DuckDB that could run complex analytical SQL queries against a database of 6 billion rows in manageable time – and we did not even have to immerse it in dry ice this time.
更有趣的是,早在 2012 年,拥有像 DuckDB 这样的单节点 SQL 引擎是完全可行的,它可以在可控的时间内对 60 亿行的数据库运行复杂的分析 SQL 查询——而这次我们甚至不必将其浸入干冰中。
History is full of “what if”s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened? The dataset size of our benchmark database was awfully close to that 99.9% percentile of input data volume for analytical queries in 2024. And while the retina MacBook Pro was a high-end machine in 2012, by 2014 many other vendors shifted to offering laptops with built-in SSD storage and larger amounts of memory became more widespread.
历史充满了“如果”,如果像 DuckDB 这样的东西在 2012 年就存在会怎样?主要要素都具备,矢量化查询处理早在 2005 年就已发明。现在看来有些愚蠢的数据分析分布式系统转型真的会发生吗?我们基准数据库的数据集大小非常接近 2024 年分析查询输入数据量的 99.9%。虽然视网膜 MacBook Pro 在 2012 年是一款高端电脑,但到了 2014 年,许多其他厂商都转向提供内置 SSD 存储的笔记本电脑,大容量内存也变得更加普及。
So, yes, we really did lose a full decade.
所以,是的,我们确实失去了整整十年。